Foursquare Explains Yesterday's 11 Hour Outage: An Overloading Of Database Shards

Mg Siegler

MG Siegler is a general partner at Google Ventures and a columnist for TechCrunch, where he has been writing since 2009. Previously, MG was a general partner at CrunchFund. And before TechCrunch, MG covered various technology beats for VentureBeat. Originally from Ohio, MG attended the University of Michigan in Ann Arbor, MI. He’s previously lived in Los Angeles where he worked... → Learn More

Tuesday, October 5th, 2010

As you may have noticed yesterday, Foursquare was down. Very down. In fact, that service says that total downtime was around 11 hours all told. That’s not good. And they know it. So they wrote a post on their blog today apologizing, explaining what happened, and saying what they’re going to be doing differently going forward to prevent it from happening again.

The “what happened” part is fairly technical. But basically it boils down to this: Foursquare data is supposed to be spread evenly over different database “shards” (think of it as database segments). At some point yesterday morning, things got uneven, with one shard getting way more data than the others. They tried to balance it out, but that didn’t work, so they tried putting a new shard in play. Then all hell broke loose.

Foursquare says they’re not exactly sure why introducing a new shard caused a total site failure — but it did. They then spent the next several hours still attempting to even out the data, but couldn’t figure out a good way to do that without keeping everything down. Eventually, they had to basically re-do the entire problematic shard, which took hours.

The good news is that despite all of this, Foursquare promises that no data was lost.

Going forward, this shard fix ensures this problem won’t happen again anytime soon. But in the future, they’re making bigger architecture changes to ensure this never happens again. They’re also looking into better safeguards to ensure that even if there is a problem, Foursquare can stay up during it.

Obviously, all of this sounds fairly reminiscent of the early days of Twitter, when the service was unable to reliably stay up. Foursquare hasn’t had issues on that level yet, but as they continue to grow, they could get there without major changes. So it’s good to see they’re thinking ahead on this stuff.

The team (which is now 32 people) also promises to communicate better in the future when downtime and/or errors occur. They’ve created a new Status blog just for that.

[photo: flickr/soupstance]

Company: foursquare
Website: foursquare.com
Launch Date: April 16, 2013
Funding: $112M

Foursquare is a geographical location based social network that incorporates gaming elements. Users share their location with friends by “checking in” via a smartphone app or by text message. Points are awarded for checking in at various venues. Users can connect their Foursquare accounts to their Twitter and Facebook accounts, which can update when a check in is registered. By checking in a certain number of times, or in different locations, users can collect virtual badges. In addition, users...

→ Learn more

blog comments powered by Disqus