Google Explains Its Google Docs Outage

Google Docs suffered an extended outage this week, which raised concerns, yet again, about the reliability involved with storing mission-critical documents in the cloud. Personally, I’d rather trust Google’s redundant server infrastructure than my own hard drive. However, for enterprise users, the problem with cloud outages is that local I.T. staff can’t do anything about the problem, unless they use a third-party backup service, for example.

Today, Google is sharing details on what happened to its Docs service, and what it’s doing to correct the problem in the future.

According to a post on the Google Enterprise Blog, the outage was caused by a change designed to improve real time collaboration within the document list, says Google. This change exposed a memory management bug which was only evident under heavy usage.

Writes Alan Warren, Engineering Director:

Every time a Google Doc is modified, a machine looks up the servers that need to be updated. Due to the memory management bug, the lookup machines didn’t recycle their memory properly after each lookup, causing them to eventually run out of memory and restart. While they restarted, their load was picked up by the remaining lookup machines – making them run out of memory even faster. This meant that eventually the servers couldn’t properly process a large fraction of the requests to access document lists, documents, drawings, and scripts which led to the outage you saw on Wednesday. 

The entire outage lasted around 30 minutes, with 24 minutes dedicated to rolling back the changes, and 5 more minutes for the normal functioning of the service to fully resume.

According to Warren, analysis of the issue has enabled Google to reduce the chances of future events, decrease resolution times if such an event was to occur again, and limit the scope which any single problem can affect.

Again, for most casual users of Google Docs, the outage probably went by unnoticed. It’s the affected Google Apps business users who are most concerned by cloud outages such as this. Transitioning to the cloud is not without its faults, but let’s remember: no system is perfect, not even the one your I.T. guy used to run for you.