Heh. Worst. timing. ever.
About an hour ago, a bunch of the engineers responsible for keeping Google alive sat down to answer questions on reddit.
Know what else happened about an hour ago? Gmail and Google+ went down around the world.
According to their previous AMA from a year ago, the team (which Google calls the ‘Site Reliability’ team, or SRE) is “responsible for the 24×7 operation of Google.com, as well as the technical infrastructure behind many other Google products such as GMail, Maps, G+ and other stuff you know and love.”
A coincidence, almost certainly. But a pretty damn funny one. Only four members of the reliability team took place in the AMA, and you can be damned sure that Google employs more than four people to keep their many millions of servers from catching on fire. As you might expect, the very top comment in the post (and dozens of others down the page) pokes fun over the unfortunate timing.
Impressively, the team didn’t seem to break much of a sweat. Each member of the team contributed answers to the thread, yet things were on the up-and-up within 50 minutes, with many users reporting that things were back within the hour. This question from the AMA gives a bit of insight as to how that could be:
Reddit user notcaffeinefree asks: “Sooo….what’s it like there when a Google service goes down? How much freaking out is done?”
Google’s Dave O’Connor responds:
Very little freaking out actually, we have a well-oiled process for this that all services use – we use thoroughly documented incident management procedures, so people understand their role explicitly and can act very quickly. We also exercise this processes regularly as part of our DiRT testing.
Running regular service-specific drills is also a big part of making sure that once something goes wrong, we’re straight on it.
Looks like that process is, indeed, pretty well-oiled (though one David S. Peck probably isn’t too impressed).