AWS took a lot of heat when its S3 storage component went down for several hours on Tuesday, and rightly so, but today they published a post-mortem explaining exactly what happened complete with technical details and how they plan to prevent a similar event from occurring again in the future.
At the core of the problem was unsurprisingly human error. Some poor engineer, we’ll call him Joe, was tasked with entering a command to shut down some storage sub-systems. On a typical day this doesn’t cause any issue whatsoever. It’s a routine kind of task, but on Tuesday something went terribly wrong.
Joe was an authorized user, and he entered the command according to procedure based on what Amazon calls “an established playbook.” The problem was that Joe was supposed to issue a command to take down a small number of servers on an S3 sub-system, but he made a mistake, and instead of taking down just that small set of servers, Joe took down a much larger set.
In layman’s terms, that’s when all hell broke loose.
Amazon explains it much more technically, but suffice to say that error had a cascading impact on the S3 storage in the Northern Virginia datacenter. To make a long story short, Joe’s error took down some crucial underlying sub-systems, which removed a significant amount of storage capacity, which caused the systems to restart. As this happened, S3 couldn’t service requests, which caused even AWS’s own dashboard to go down (which is, you know, kind of embarrassing).
By now, the outside world started to feel the impact and your favorite websites, apps and cloud services were beginning to behave in a wonky fashion.
As the afternoon wore on, the company was working feverishly to get the service back online, but the size of the systems was working against them. When the system shut down, something that AWS says it hasn’t had to do in many years, it became a victim of its own success. S3 capacity had grown to such an extent in the affected datacenter that when they restarted, running all of the safety checks and validating the integrity of the underlying metadata took a mite bit longer than they expected.
To reduce the prospect of a similar human error in the future, the company is making some changes. In their words, “We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level.” That should prevent someone like Joe from making a similar mistake in the future.
In addition, AWS is looking at ways to break down those S3 sub-systems, which were core to the problem, into much smaller pieces or cells, as they call them, something they have tried to do in the past. Obviously, the sub-systems proved too large to recover quickly (or at least quickly enough).
They close with an apology and a promise to do better. In the end, it was a combination of factors that caused the issue, starting with a human error and then cascading across systems that hadn’t been designed to deal with an error of this magnitude.