When we heard about Instagram (and other sites) going down when Amazon Web Services’ North Virginia hub was hit by a storm — not the first time AWS has gone down (April 2011 was another notable outage) we couldn’t help but wonder: could it have been avoided?
Mike Krieger, one of the founders of Instagram, once presented a great slideshow describing how Instagram was able to scale up so well. “The cleanest solution with the fewest moving parts as possible,” has been one of the guiding principles for the photo-sharing app, bought by Facebook in a billion-dollar deal earlier this year. Could that too-simple architecture have played a role here?
We’ve reached out to Twitterverse and beyond to get some thoughts on that.
Disclaimer: Without knowing the exact ins and outs of Instagram’s architecture, it’s hard to say why Instagram and other services, like TechCrunch’s database CrunchBase.com, are still down while other sites that had been affected, like Pinterest, Netflix and Heroku, appear to have started working again (although some say they’re still having problems).
Dominik Tobschall, the co-founder and CEO of Munster, Germany cloud-based contacts startup Fruux, notes that there is a way to run services so that they are not hinging on the health of one physical data center, but that the bigger the service is, the harder that can be:
“Since everything that can go wrong always will go wrong with technology, it’s important to deploy applications in a way where you always have in mind ‘if power or connectivity or anything else fails in an availability zone, all servers in that zone might power down/be disconnected’ and ‘if a comet hits the datacenter, there is a huge earthquake or whatever, all servers in that region might power down/get disconnected.’
“The only way to protect an application from downtimes is to run machines in multiple availability zones; additionally run machines in different regions; have a good automatic (or quick manual) failover methodology in place. But it’s incredibly hard and the bigger a service is the harder it gets.”
Reader Nicholas James made a similar point. In his view, there is an inverse variation between high latency (distributing the operation of your cloud service across multiple regions means if one goes down you have other places where it will work) and the ease of replicating a database (it gets harder the more you have).
With a service like Instagram, reliant on a worldwide network of users uploading thousands of images (of food and more) everday, it may be that this kind of replication is impossible.
Aaron Levie, CEO of cloud services company Box, notes that the simplicity of Amazon’s infrastructure-as-a-service model is compelling but also takes a lot of control out of a company’s hands:
“At the end of the day, the cloud’s availability will come down to its physical infrastructure being available — it looks like Amazon’s data center in Virginia experienced a power failure, which knocked out a number of its systems there. For the applications built on top of Amazon, sometimes negative consequences from these events can cascade through your infrastructure (e.g. when one service goes down, it then overloads another service that was otherwise fine), and in other cases some apps just don’t have resilience for these events built into their software.
“AWS doesn’t necessarily promise to handle these situations gracefully for you; because it’s a provider of infrastructure as a service, you get pretty low-level access to the technology (vs. making it super abstracted). That comes with huge benefits, but equally has consequences if the infrastructure disappears. That said, AWS has a pretty great track-record for uptime, but of course given their popularity, when they hit a snag the entire internet notices. At Box, we don’t use AWS for any primary infrastructure, and we run out of a number of our own datacenters to ensure fault tolerance in the event of a physical system experiencing issues, so that helps.”
“That’s the nature of relying on someone else for your website storage or application hosting. If your host goes down, so do you. Although AWS doesn’t go down too often, it might be prudent to have a backup that’s not based on AWS.
“The main selling point for AWS is that it’s cheap. Wicked cheap. It allows the little guy to compete with the big boys. Even a simple colocated server will cost upwards of $300 USD/month for a good one. AWS lets you have your data in more places at once a la carte, so you don’t have to pay for what your’e not using. It allows you to scale your app/website without worrying about infrastructure.”
Vineet Thanedar, another one of our IT heroes, tells me that CrunchBase’s hosting is managed by EngineYard (which runs on AWS). “While AWS is back up, Engine Yard is still bringing up all their instances across clients and fixing issues. Engine Yard has thousands of customers.”
Barry Nolan, the CEO and co-founder of in-app messaging specialist Converser, pointed me to a great note from the Twilio engineering blog that explains why Twilio, which also runs using AWS, was not affected during a previous outage.
It’s a technical post but is full of examples of how you can architecture a service so that downtime in one place doesn’t bring the whole thing to a crashing halt.
So that people can get on with eating their meals and drinking their lattes.
[Image: Aussiegall, Flickr]