A memory leak and a failed monitoring system caused the Amazon Web Services outage on Monday that took out Reddit and other major services.
According to a post Friday night, AWS explained that the problem arose after a simple replacement of a data collection server. After installation, the server did not propagate its DNS address correctly and so a fraction of servers did not get the message. Those servers kept trying to reach the server, which led to a memory leak that then went out of control due to the failure of an internal monitoring alarm. Eventually the system ground to a virtual stop and millions of customers felt the pain.
By Monday morning, the rate of memory loss became quite high and consumed enough memory on the affected storage servers that they were unable to keep up with normal request handling processes.
The failure in its North Virginia region eventually interrupted Reddit, Foursquare, Minecraft, Heroku, GitHub, imgur, Pocket, HipChat, Coursera and a number of others.
In the past, Amazon’s Elastic Block Storage (EBS) servers have proved troublesome. This outage proved not much different. The EBS servers, feeling the memory leak, began losing the ability to process customer requests, causing the number of stuck volumes to increase quickly. The server degradation came all at once, causing a tax on the system as not enough healthy servers could be found to replace them all.
The outage started at 10 a.m. PST. Five hours later, AWS discovered the root of the problem. An hour later things got back to normal.
AWS says it is taking a number of steps to prevent similar issues going forward. The group plans to deploy monitoring that will sound the alarm if this specific memory leak problem arises again in any of its production EBS servers. Next week it will begin deploying a fix for the memory leak issue.
AWS has had its share of outages over the past several months. Its problems are magnified by an increasingly competitive market that is seeking to slow AWS’ momentum by casting doubt on its infrastructure.
I get the competitive issues in play here. But customers should not overlook AWS’ uniqueness in providing a service that allows startups to use elastic computing, network and storage to compete on the world stage. It may have outages, but no other service comes even close to what AWS offers its customers.