One of the root causes? A bug in the Skype for Windows client (version 5.0.0152).
Rabbe kicks off by explaining that a cluster of support servers responsible for offline instant messaging became overheated on Wednesday, December 22.
A number of Skype clients subsequently started receiving delayed responses from said overloaded servers, which weren’t properly processed by the Windows client in question. This ultimately caused the affected version to malfunction.
Initially, users of Skype’s newer and older Windows software, as well as those using the service on Mac, iPhone and their television sets, were unaffected.
Nevertheless, the whole system collapsed as the faulty version of the Windows client, 188.8.131.52, is by far the most popular – Rabbe says 50% of all Skype users globally were running it, and the crashes caused approximately 40% of those clients to fail.
The clients included roughly a third of all publicly available supernodes, which also failed as a result of this issue.
From the blog post:
A supernode is important to the P2P network because it takes on additional responsibilities compared to regular nodes, acting like a directory, supporting other Skype clients and establishing connections between them by creating local clusters of several hundred peer nodes per each supernode.
Once a supernode has failed, even when restarted, it takes some time to become available as a resource to the P2P network again. As a result, the P2P network was left with 25–30% fewer supernodes than normal. This caused a disproportionate load on the remaining available supernodes.
Rabbe goes on to explain a lot of people who experienced crashing Windows clients started rebooting the software, which caused a huge increase in the load on Skype’s P2P cloud network. He adds that traffic to the supernodes was about 100 times what would normally be expected at the time of day the failure occurred.
A perfect storm in the P2P clouds, so to speak.
To learn how Skype supported the recovery of its supernode network, and what they’ll be doing to prevent this from happening again, I suggest you go read the full blog post.
And major kudos to the company for being so prolific in explaining what happened.