Netflix Open Sources Chaos Monkey – A Tool Designed To Cause Failure So You Can Make A Stronger Cloud

Next Story

The New Digg Is Launching On Wednesday: Will Be “Beautiful, Image-Friendly, And Ad-Free”

 Netflix has open sourced “Chaos Monkey,” its tool designed to purposely cause failure in order to increase the resiliency of an application in Amazon Web Services (AWS.)

It’s a timely move as AWS has had its fair share of outages. With tools like Chaos Monkey, companies can be better prepared when a cloud infrastructure has a failure.

In a blog post, Netflx says that this is the first of several tools that it will open source to help companies better manage the services they run in cloud infrastructures.  Next up is likely to be Janitor Monkey which helps keep an environment tidy and costs down.

Chaos Monkey has achieved its own fame for its innovative approach. According to Netflix, the tool “randomly disables  production instances to make sure it can survive common types of failure without any customer impact.  The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables — all the while we continue serving our customers without interruption.”

Netflix unleashes Chaos Monkey in the middle of a business day and monitors with engineers standing by to address any problems. It gives them an understanding of the problems in the system and lessons for improving its weaknesses.  With that knowledge, Netflix builds automatic recovery mechanisms to deal with the vulnerabilities.

The goal is to have the system so resilient that a failure at 3 am on a Sunday will not even be noticed.

Instance failure is common in the cloud.  Even if you are confident your architecture is solid there is no fool-proof protection against any host of issues. Problems may strike next week or next month. A simple fix can be disastrous at times, causing problems you would never expect.

From the next Netflix blog:

Do your traffic load balancers correctly detect and route requests around instances that go offline? Can you reliably rebuild your instances? Perhaps an engineer “quick patched” an instance last week and forgot to commit the changes to your source repository?

Managing apps in the cloud can be complex.  When AWS goes down, there will always be cries about the cloud and its problems.  People need to realize that the cloud is like a giant programmable environment. It does fail but there are tools like Chaos Monkey that can help customers be prepared for the issues that will inevitably arise.