The day Amazon S3 storage stood still

By now you’ve probably heard that Amazon’s S3 storage service went down in its Northern Virginia datacenter for the better part of 4 hours yesterday, and took parts of a bunch of prominent websites and services with it.

It’s worth noting that as of this morning, the Amazon dashboard was showing everything was operating normally.

While yesterday’s outage was a big deal for those affected, it’s important to remember that S3 has proven remarkably dependable over the years, and while the Northern Virginia datacenter was down, S3 remained up in its 13 other regions.

To provide some additional perspective, CloudHarmony, a service that tracks cloud outages reports that S3 has typically exceeded its Service Level Agreement (SLA), which promises that the service will be up 99.9 percent of the time and offers refunds for those times when it’s not. CloudHarmony found that in most cases S3 has achieved 100 percent annual availability since the company began monitoring cloud services in 2014. The notable exception was an S3 outage in August, 2015.

The company points out that it tracked a similar outage on Microsoft Azure virtual machines and object storage on February 19th that lasted over 5 hours, but didn’t get nearly the attention yesterday’s incident did.

Ben Kepes, a cloud computing analyst and commentator, says folks in AWS headquarters in Seattle probably didn’t sleep too well last night, but these types of outages are bound to happen from time to time. “AWS is the biggest public cloud vendor by many orders of magnitude. As such, any outage is highly impactful across the market. If anything, the outage showed just how many third parties rely on AWS for their infrastructure. The reality, as unpalatable as it sounds, is that failures happen from time to time and organizations need to plan for that failure,” he said.

Kepes added outages happen everywhere, as any IT pro knows. They just aren’t usually quite as high profile as when they happen to a popular vendor like AWS. “While people will wring their hands about this, the fact is that outages happen with every flavor of infrastructure,” he said.

Dave Bartoletti, a Forrester analyst who covers the cloud industry, agrees and says it’s a wake-up call for services to bake redundancy into their storage, regardless of where it is. “You can store your data in multiple regions or you could build a site that doesn’t rely [solely] on S3 as its only storage,” he explained.

These analysts aren’t blaming the victim when they say this. This is the kind of redundancy that IT pros have been building into their systems for years. What they’re saying is that the cloud really isn’t any different.

But analyst Patrick Moorhead of Moor Insight & Strategy wasn’t ready to be so forgiving, saying that the outage could cost millions of dollars in downtime when all was said and done, and Amazon has to do a better job of setting up those redundant systems on behalf of customers.

“Just because this is part of the public cloud, I don’t think we should expect things like this to happen. This rarely happens in industries like banking because fault tolerance is built into the architecture, not a bolt-on,” he said.

While that may or may not be a fair criticism, the fact is that regardless of the type of infrastructure you have, an in-house data center or cloud services like AWS, outages are a fact of life. No approach is fool-proof. That doesn’t mean AWS gets off the hook for this, but if you look at its track record, it probably deserves the benefit of the doubt.