Oops! A discussion about priorities and risk
This post is about a case where we didn’t follow our own advice or industrybest practices and it bit us. But then interesting other things ensued andwe learned some things.
The Gotcha Moment
Last week, I wrote a blog post about AppSec Programsthat included a live Trello board and exposed a fair amount of inner workings of how I think about AppSec.I was excited about the post and the concrete template to work from. It won’t get you there, butit is at least a decent reference point.
Well, after I pushed the post, I was naturally logged in to our Wordpress and saw that someupdates needed to be applied. Like any good security person, I don’t like tolet updates sit around not being applied so I went ahead and applied them.
But unlike a good security (or devops) person, I didn’t follow any kind of change control process,didn’t test in a dev or staging environment, didn’t even snapshot the instance.
So … what I’m saying is that a week ago when this happened, we didn’t haveany non-production version of our website: jemurai.com and even worse, we didn’treally have a way to get one back if we needed to.
Well, one of the updates I installed was a theme update that introduceda bunch of new content areas with default content. Think “Lorem Ipsum” all over thesite with stock photography of people doing who knows what. I was mortified.
I mean, one cool thing about running your own business is that you can actuallywrite this post … but at that time, I was pretty focused on how unprofessionalit was and how that hits real close to home.
What We Did
Well, when I first saw what had happened I looked for backups and other ways to revertthe change. Alas, we were not using a Wordpress.com versioned theme so it wasn’t goingto be as easy as simply rolling it back. As I mentioned, we didn’t have anotherbackup mechanism in place other than raw AWS Snapshots and none of those wererecent enough to be a great option.
The reality is that we had wanted to redo the website completely using GitHub Pagesfor some time. I’d used that technology for years, just not on the company site.Its not like we were really leveraging Wordpress anyway. We’d even had the certexpire a few times, embarrassingly harkening back to a much earlier stage in mycareer where we built tools to monitor for that.
So we flipped a switch, exported the blog posts and started a new website withGitHub Pages, based on Jekyll and a theme we had used for a few sites we run.Luckily, the blog post itself didn’t look that bad - but anyone that wentto the main home page would have been a bit confused by the generic text and stockphotos.
It took about 3-4 hours to have something that was good enough to push, so at thatpoint we flipped DNS and continued with minor updates. The next morning we hadsome links to fix and we’re still migrating older blog posts - though most of thatwas also automated.
Meanwhile, behind the scenes we were still scrambling a little bit. Therewas nothing quite like a question in the #appsec-program channel of the OWASP Slackthat went something like this:
Hey Matt, I can’t find this page that is referenced in your post. What’s up?
Oh, hold on, let me just … copy that old post I referenced into the new sitethat is on an entirely different platform than the one the original readerreferenced.
Interestingly, Pingdom reported zero downtime. Might make for an interestingdiscussion about how much you need to be able to see to know things are ok,and why we think securitysignal.io is so interesting.
The fact is, I think the website is a little better now. I also think we have betterautomation around the certificates (since we never have to touch it), better collaborationwith pull requests across the team and better backups. It is also nice,especially in security, to have a static website with no PHP or database code. I’mwriting this post in markdown in my favorite editor instead of some wysiwyg Wordpresseditor. I can do all that perfectly well offline.
But there is a whole other angle I want to bring up with this scenario.
Business Continuity and Risk
We had been telling our customers to keep backups, use change control,define RPO and RTO and test for it. But we didn’t do it ourselves forour website.
Seems embarrassing. But let’s step back for a minute and talk about theactual risks here.
When I write a blog post, we get a little traffic, but interruption of thattraffic isn’t a real event. Its mostly AppSec people that are curious toimprove their craft. They’re not buying from us and if they are, they’renot worried about the website.
I’m not saying the website doesn’t matter at all, but there are pros andcons to having more controls in place to manage the uptime. Specifically:
Pros to having more controls in place:
- Probably avoid downtime
- Lower risk of losing a potential customer
Cons to having more controls in place:
- Can’t just go whip off a post
- Lower risk of gaining a customer (because we don’t post)
- Have to maintain two environments
- Have to pay for two environments
- Twice the attack surface
Now, I have enough trouble blogging regularly to begin with - I don’t needto get someone’s approval to make it even harder!
Most businesses assume that continuity is critical. For some applications itreally is. For our customer facing applications, we have backups andredundancy and change control processes in place. “Real talk”: it isstill probably not critical. It presents an inconvenience and we havecome to expect little inconvenience.
When we, as a security community, treat the website with little to nomaterial risk the same as a business system that for some reason hasa need for a very high SLA, its like we’re taking a one size fits all approachto security rules and advice and we’re not credible.
Consider:
According to Gartner, the average cost of IT downtime is $5,600 per minute. Because there are so many differences in how businesses operate, downtime, at the low end, can be as much as $140,000 per hour, $300,000 per hour on average, and as much as $540,000 per hour at the higher end. (https://www.the20.com/blog/the-cost-of-it-downtime/)
I would argue a lot of business systems aren’t really businesscritical. Of course, it is pretty easy to build resilient systemsin this day and age … so there’s not a lot of excuse not to whenit matters but …
Anyway, no excuses - it was a failure on my part even if I canMonday morning quarterback it to be just a learning experience.
Failures
Speaking of failures, I think it might also be appropriate totake a moment to talk about failures.
I have been hesitant to post this blog post. What if people decidethat I am incompetent because I didn’t back up Wordpress beforeupdating? Or worse, careless?
The truth is, we have all made mistakes. Its moredangerous not to talk about them and learn from them.
Like any experienced engineering leader, I’ve made mistakes before. Like the time I forgot a where clause when deleting out of a table in production oracle db and I ended up sitting up all night with the DBA’s while they restored from tape.
delete from whatever_table_it_was-- where id=13
That was a big mistake.
I’ve made innumerable smaller mistakes inintricate code. I like to ask people when we give training howmany security bugs they think they’ve introduced into systems theyare building. My answer is thousands or more, I’m sure.
A big part of how we grow up in security is how we handle failures.Can we step back, learn and do better? A lot of that is cultural.Its something you can build organizationally but its not somethingyou can just get or buy or manufacture. It takes work and trust.It takes confidence and resilience.
So part of the reason I’m writing this post is to let my team andanyone else who is interested know that I make mistakes. What Iwant them to notice is how I respond and what happens next.
Conclusion
As much as I can explain away this event, it was eye opening for me.
It is always better to have made conscious risk decisions than to havebeen falsely lulled into suboptimal decisions.
I’m a little embarrassed about it. Everythingelse we build is engineered - it needs to be robust - it needs varyingdegrees of failover, redundancy, etc. There’s no reason the web siteshould be an exception.
So I’m refreshing our threat model and keeping continuity as a focusthrough the process. We’re getting better. I don’t know a whole lotbut I’m pretty sure we’ll always be getting better - which means wewill also always be making mistakes.