It’s a timely move as AWS has had its fair share of outages. With tools like Chaos Monkey, companies can be better prepared when a cloud infrastructure has a failure.
In a blog post, Netflx says that this is the first of several tools that it will open source to help companies better manage the services they run in cloud infrastructures. Next up is likely to be Janitor Monkey which helps keep an environment tidy and costs down.
Chaos Monkey has achieved its own fame for its innovative approach. According to Netflix, the tool “randomly disables production instances to make sure it can survive common types of failure without any customer impact. The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables — all the while we continue serving our customers without interruption.”
Netflix unleashes Chaos Monkey in the middle of a business day and monitors with engineers standing by to address any problems. It gives them an understanding of the problems in the system and lessons for improving its weaknesses. With that knowledge, Netflix builds automatic recovery mechanisms to deal with the vulnerabilities.
The goal is to have the system so resilient that a failure at 3 am on a Sunday will not even be noticed.
Instance failure is common in the cloud. Even if you are confident your architecture is solid there is no fool-proof protection against any host of issues. Problems may strike next week or next month. A simple fix can be disastrous at times, causing problems you would never expect.
From the next Netflix blog:
Do your traffic load balancers correctly detect and route requests around instances that go offline? Can you reliably rebuild your instances? Perhaps an engineer “quick patched” an instance last week and forgot to commit the changes to your source repository?