Tony Howlett

Mar 21, 2017

Did A Single Keystroke Take Down Amazon?

The current readout on the eastern region outage of Amazon’s popular AWS cloud service is that it was caused by single technician’s errant command.  The four hour outage caused downtime for many large websites and apps that depend on the service – including Netflix, Slack, and the SEC. It also made headlines in all the major news services.  And while the popular view holds the root cause as a techie typo, this was not the ultimate underlying problem.  Sure, the sequence of events was put in motion by a single line of code. But if your entire network (or an entire region in this case) can be taken out by such an insignificant action, then I would lay the blame on two things: bad or lacking administrative controls, and deficient disaster recovery testing.

“Are You Sure?”

The tech responsible for this disaster is probably in the unemployment line or facing a limited career future at Amazon, at the least.  But this should not be the case.  He (or she) made a mistake.  This is HUMAN.  If that person routinely made that mistake, then I would say the issue is a bad or incompetent employee.  But to blame a single person, or even the poor single line of code, is to scapegoat the blame and avoid laying it where it really lies.

Proper design and controls should have been in place to prevent any one command from causing such damage. Especially at an organization that prides itself on uptime and provides a mission critical service to companies.  Even the Windows desktop OS will warn you before you delete a swath of files with that annoying “Are you sure?” prompt.   Why this was not the case should be the focus of Amazon’s post-incident response, not why someone made a normal human mistake in typing a command.

Planning – and Testing – For Disaster Recovery

Secondarily, the outage was made worse by a poorly tested disaster plan.  Apparently the region had not been rebooted in years and when this was done, things did not recover as expected.  Again, for an organization that should plan for these kinds of events with recovery tests, it is to their discredit that they did not.  Even simple, smaller-scale DR tests might have surfaced many of the issues that slowed down their real recovery.  This is why we do DR testing; to experience the unknown issues in a controlled (and hopefully non-critical) environment.   Amazon did not do this testing, or did not do it well enough, and the results exposed this failure.

Hopefully Amazon is learning from this event and not scapegoating the poor individual on whom fate landed to point out this glaring omission.  For those of you who might have similar issues hidden in your infrastructure (probably all of us), use this as a learning lesson on how not to design and test critical infrastructure.

Share on FacebookTweet about this on TwitterShare on Google+Share on LinkedIn

Tags: , , , , , , , , , , , , ,