The rise of chaos engineering

How do you build reliable software? It is a question that has been at the top of my mind the past few weeks, as I seem to be increasingly confronted by software that just doesn’t work anymore. Bugs, crashes, errors, data leaks: they are so common in our every day lives that they can seem completely unremarkable.

The existing tools — unit tests, application performance monitoring, among many others — are useful to a degree, but they are clearly not the panacea to the problem. In response, there is a growing movement building around a new field known as “chaos engineering” that is designed to dramatically increase the quality and reliability of delivered services.

Last week, I had a conversation with one of the evangelists of the movement, Kolton Andrus. Andrus is the founder and CEO of a startup called Gremlin, which is building chaos engineering as a service. Formerly, he spent years working at Amazon and Netflix, where he implemented what have now been dubbed chaos engineering principles into those software teams.

The methodology of chaos engineering is simple in concept, but hard in execution. Software systems today are complex and tightly-coupled, meaning that the delivery of a webpage may actually rely on hundreds of database, file, image, and other requests in order to render. There has been a “combinatorial explosion” according to Andrus, particularly for engineering teams that have chosen a microservices architecture.

Chaos engineering takes the complexity of that system as a given and tests it holistically by simulating extreme, turbulent, or novel conditions and observing how the system responds and performs. What happens if a disk server suddenly goes down, or if network traffic suddenly spikes because of a DDoS attack? What happens if both happen at the same time? Once an engineering team has that data, it can use the feedback to redesign the system to be more resilient.

Andrus offered the example of an info page for a Netflix video. If the video streamer is down, then the movie shouldn’t be accessible. However, if the database for the reviews data isn’t available, a user should still be able to watch the video (maybe they know exactly what they are looking for). By identifying what components of a page can degrade without affecting the user, Netflix can increase the reliability of its systems.

Chaos engineering is pretty simple — and fun too. Break things, break them all the time, and keep breaking things …. until they work again, and always. The challenge though is how to carefully break things in a way that doesn’t degrade actual performance for a running web application. Netflix, for instance, doesn’t want millions of users to go without video streaming just because a couple of chaos engineers are testing whether their data center can survive a power outage.

That’s where Gremlin’s “resiliency as a service” comes in (I prefer “failure as a service” but Andrus told me that is hard to sell to enterprise. Go figure). Using Gremlin, chaos engineers can setup different scenarios, run simulations of those scenarios, and most importantly, quickly revert back a scenario if a system is degrading worse that expected. The idea is to offer exact control over every step of the simulation.

Chaos engineering isn’t a replacement for traditional software reliability techniques. For instance, one popular technique for improving software reliability is the use of “unit tests.” The idea is to write a small test that checks that a very specific section of code is working properly. For instance, a developer might check that a valid login actually logs in a user, or that a certain data response to a request is formatted properly. By writing tests constantly as new features are added, software engineers can quickly identify if new code breaks existing functionality.

Yet, there are limits to the value of unit tests. First, they are only as good as the developer who writes them, and they cannot check for things that a developer hasn’t though of and actually coded into the system. Furthermore, unit tests are designed to test small kernels of code, but what happens when all of that code interacts with each other in a system? If unit tests are focused on the “micro,” then chaos engineering is the complementary test focused on the “macro.”

I asked Andrus what the biggest challenges are for growing the chaos engineering movement. In his mind, the challenge is often deeply cultural in the engineering team. Some teams don’t have the flexibility to simulate disasters on a system because real-life disasters are happening so rapidly that a team might be spending all of its time triaging rather than trying to get ahead of the situation.

Testing can also be very political. Finding the points of failure in a system might force deep conversations about a particular software architecture and its robustness in the face of tough situations. A particular company might be deeply invested in a specific technical roadmap (e.g. microservices) that chaos engineering tests show is not as resilient to failures as originally predicted.

There is now a chaos engineering page, a community, and meetups around the world. Andrus told me that the first people to understand the challenges are often the engineers who have had to respond to crises on a Friday night and don’t want to be staring at their pagers waiting in anticipation for something to fail.

Gremlin recently publicly launched, and the company raised $7.5 million in a Series A in December.

Modern society is failing us, but not only because of mistakes. Increasingly, engineering orgs are building tests that fail on purpose in order to build better reliability into systems. With any luck, a bit more chaos might just lead to more stability in our software.