A large number of customers of Rackspace Cloud, including Techcrunch, have been experiencing sporadic downtime for the past hour or so. The status blog reports that the service was degraded, and other reports state that it is due to a power outage at the Dallas network operations center. Customers of both Rackspace Cloud and Slicehost are affected, putting services such as Posterous, Dailybooth, tr.im and others out of commission.
I got the first alert as I was stepping towards the door to leave (it is always like that), and when I got back to my seat found that half the web seemed to be talking about it. The main Techcrunch site was still serving pages to most, due to our super-aggressive-mega-cache, but it seemed that the entire Dallas NOC was being rebooted.
From the status blog:
As of 12:35AM CST Rackspace Cloud engineers are seeing intermittent connectivity to our WC2 cluster in our Dallas – Fort Worth (DFW) and data center. We are working to resolve the issue as quickly as possible and will update the status post accordingly.
If you have any questions or concerns please contact our support via live chat or at 1-877-934-0407 international +1.210.581.040.
UPDATE: As of 1:15am CST, Rackspace Cloud engineers are still working to address the current connectivity issues. We are making significant progress and we will post another update here shortly.
UPDATE: As of 1:30am CST, service has been restored to the majority of our technology clusters in our WC2 cluster. Some sites may still be having performance issues, We are continuing to monitor and address the situation. Additional updates to follow.
From slicehost (who actually mention power outage):
November 3rd, 2009 @ 01:14 AM
UPDATE 1:16AM CDT: Power has been restored, however, we’re working to check all our systems and make sure everything comes back up correctly. Slices have not yet been restarted. We’ll try to keep you updated as much as possible.
We are currently experiencing a service interruption in our Dallas data center. Our engineers are currently working to restore connectivity. We will send an update as soon as information becomes available.
And from Scoble, on Twitter:
(the list he pointed to is actually a good one to follow if you are a Rackspace customer).
This will likely lead to many cursing the cloud, when in essence there is nothing about this problem that seems unique to being a ‘cloud problem’. What is more concerning is that the NOC seems to have run out of power (almost unimaginable) and then took so long to come back online.
So – how did you all spend the downtime? It seems most admins and devs from Rackspace hosted companies were just hanging out on Hacker News and IRC bitching about RS :) (first time I noticed that he shares initials with his employer).
As soon as we know what happen etc. or any more, we will be posting updates here
Update From Rackspace: from their site:
Rackspace has experienced a service interruption during tonight’s scheduled maintenance on UPS Cluster G. We were testing phase rotation on a Power Distribution Unit (PDU) when a short occurred and caused us to lose the PDUs behind this Cluster. The phase rotation allows us to verify synchronization of power between primary and secondary sources.
All power has been restored and devices are being brought back online. The PDUs were down for a total of about 5 minutes. We have aborted the maintenance for the remainder of the evening and will reschedule this for another date.
Service to Cloud sites has been restored and we are continuing to work with Cloud sites customers to bring them online.