Facebook Gives A Post-Mortem On Worst Downtime In Four Years

Jason Kincaid

Jason Kincaid worked as a writer for TechCrunch from April 2008 through 2012. He grew up in Danville, California and later relocated to UCLA in Los Angeles, California, where he studied biology with a minor in ‘Society and Genetics’. You can reach him at jkincaid@gmail.com → Learn More

Thursday, September 23rd, 2010

Facebook’s had a rough day. In fact, it’s had its worst day performance-wise in over four years, with 2.5 hours of downtime that resulted in countless complaints from users. Perhaps more important, it also had a bevy API problems, and its Like buttons — which are embedded on over 350,000 sites across the web — were apparently busted too. When Facebook goes down, it’s a big deal.

This evening Facebook Director of Software Engineering Robert Johnson has written a post-mortem of the outage, explaining what caused the site to fail.

According to Johnson’s post, the problem stemmed from an automated system Facebook had built to check for invalid configuration values in its cache. Unfortunately, that automated check backfired — to the point that Facebook had to turn off the site entirely to recover. Here’s a portion of the explanation:

Today we made a change to the persistent copy of a configuration value that was interpreted as invalid. This meant that every single client saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries a second.

To make matters worse, every time a client got an error attempting to query one of the databases it interpreted it as an invalid value, and deleted the corresponding cache key. This meant that even after the original problem had been fixed, the stream of queries continued. As long as the databases failed to service some of the requests, they were causing even more requests to themselves. We had entered a feedback loop that didn’t allow the databases to recover.

The way to stop the feedback cycle was quite painful – we had to stop all traffic to this database cluster, which meant turning off the site. Once the databases had recovered and the root cause had been fixed, we slowly allowed more people back onto the site.

Facebook has generally had a good track record in terms of keeping its homepage alive, but I’ve heard repeated complaints about the integrity of its API. And given Facebook’s goal of becoming the social fabric of the web — which entails maintaining a presence on countless third party sites — it’s imperative that it keeps its various widgets and authentication buttons working properly.

Company: Facebook
Website: facebook.com
Launch Date: February 1, 2004
IPO: NASDAQ:FB

Facebook is the world’s largest social network, with over 1 billion monthly active users. Facebook was founded by Mark Zuckerberg in February 2004, initially as an exclusive network for Harvard students. It was a huge hit: in 2 weeks, half of the schools in the Boston area began demanding a Facebook network. Zuckerberg immediately recruited his friends Dustin Moskovitz, Chris Hughes, and Eduardo Saverin to help build Facebook, and within four months, Facebook added 30 more college networks. The original...

→ Learn more

Tags:
blog comments powered by Disqus