As many of you know, a lot of the sites that use Rackspace as their hosting provider were down for about an hour yesterday. That’s because Rackspace went down. Apparently, it was a power outage at a data center that caused it, an incident report that we’ve obtained explains.
While Rackspace has backup systems in place, a series of events apparently caused those backups to fail, resulting in the servers going down. Here’s the key nugget:
The breaker on the primary utility feeder tripped, initiating a sequence of events that ultimately caused a power interruption in Phase I and Phase II of the data center. All systems initially came up on generator power without customer impact. The ‘A’ bank of generators, which support UPS clusters A and B in Phase I and UPS cluster E in Phase II, then experienced excitation failure which escalated to the point where the generators were no longer able to maintain the electrical load. Rackspace then attempted to switch to our secondary utility feeder, but was unable to do so due to an issue in the Pad Mounted Switch (PMS). At approximately 3:15pm CDT, power supply through UPS clusters A, B and E was lost when the batteries in those clusters discharged, and equipment receiving power through those clusters experienced an interruption in service.
The service says only one of its nine data centers were affected by this failure, but many high profile sites collapsed as a result, including EventBrite, Justin Timberlake’s site and Michelle Malkin’s popular political blog. As Rackspace noted yesterday that “We owe better, and will deliver.”
Below, find the full incident report.