Why Gmail Failed Today

When Gmail went down today, it caused more than a minor panic. People, like me, who use Gmail as their primary email couldn’t get much work done. There’s nothing like an outage to make you realize how much you rely on something.

So what happened exactly? Isn’t Gmail supposed to have multiple points of failure? Well yes, Gmail has thousands and thousands of overlapping mail servers which can pick up the slack if any one fails because the data is replicated and spread all around. But there are also request servers which do nothing but route the requests for email to whichever server (with the right emails on it) happens to be available.

It tuns out that Google took down some regular email servers for routine maintenance, and because of some recent changes, that overloaded the request servers. Google engineering VP Ben Treynor explains on the Gmail Blog:

At about 12:30 pm Pacific a few of the request routers became overloaded and in effect told the rest of the system “stop sending us traffic, we’re too slow!”. This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded. As a result, people couldn’t access Gmail via the web interface because their requests couldn’t be routed to a Gmail server. IMAP/POP access and mail processing continued to work normally because these requests don’t use the same routers.

So much for redundancy.

Gmail, which recently passed AOL to become the third largest Web mail service in the U.S., is obviously having some growing pains. A few hours of downtime is not the end of the world, although it might seem like it at the time. It just better not make this a new habit.