Move slow and break nothing

Facebook Messenger was down for me for about an hour earlier this week. My MacBook Pro randomly kernel panics overnight and restarts. Slack was down, and Github, and AWS. A little more than a year ago, Dyn went down, throwing the DNS layer of the internet into a tailspin. Practically every chip made by Intel has serious security flaws. Equifax leaked 143 million accounts. Tokyo-based Coincheck lost over $400 million in tokens due to hackers.

If software is eating the world, then that might explain why everything seems so ridiculously broken these days.

It’s easy to just blame companies, or hackers, or software engineers, and it’s just as easy to just give up and believe that nothing is going to get better and revert to a pre-agrarian society. What we have is a real crisis in reliability, not just across software, but across our entire society. Even the U.S. government had some serious downtime this week.

What’s going on is that we have greatly increased the magnitude of complexity of our society’s systems, even as we couple them more tightly together. Charles Perrow, a sociology professor at Yale, described the combination of these two as “normal accidents” in an eponymous book. It’s an oxymoronic term for a very intelligent observation: that what we think of as “accidents” or crashes or bugs are really quite common and indeed, inevitable, given the design of systems that we rely on.

Complex systems are ones in which changes, even small ones, can have disproportionate effects on the outcome of a system. Take last year’s downtime of S3, Amazon’s storage layer. According to Amazon’s after action report: “an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”

Amazon has fixed the issue and put in new safeguards to make sure such a change can’t happen again. That’s fantastic, and Amazon should be lauded for writing up and disclosing a comprehensive report on the error. But this was a “normal accident” — the sheer complexity of Amazon’s services means that the surface area of things that can go wrong is practically infinite.

On top of complexity, tight coupling means that various independent parts of a system are designed to work closely together. When S3 went down, it knocked out a bunch of major websites, because websites had no backup or redundancy in the event that Amazon’s services were not working. That is, except for Netflix, which had developed redundancies in its infrastructure to ensure that the failure of any individual component would not bring down the entire system.

Everything about our modern world has increased complexity and how tightly coupled our systems are. Take software development itself. The (usually) clearly designed APIs and libraries of the host operating system have been replaced by a ghastly and constantly evolving collection of libraries and web frameworks, a palimpsest of code and hope.

Even the supposedly stable parts of the stack can be our undoing. The Heartbleed bug in the OpenSSL library was a gaping hole in every single secure transaction that happened on the web. It just so happened that at the time, there was a single maintainer working full-time on the project, and that the OpenSSL Software Foundation had annual donations of $2,000.

This starts to get at the rot that is happening. Everything requires maintenance, practically all the time. It doesn’t have to be millions of man-hours, but it is also certainly not going to be zero either. Yet, coding libraries are abandoned all the time. Many popular libraries are down to a single maintainer, who keeps the library alive but can hardly be expected to guarantee its performance.

And yet, we laud innovators who build new software libraries even while the edifice of our progress disintegrates. Academics are increasingly studying this problem of how society views maintainers (hint: not well). Ultimately though, we are all responsible for these outcomes, and we all need to take the opportunity to reduce complexity and increase reliability for any system we are a part of, whether software or not.

Can we do sensitivity analysis on each component of the system to ask what would happen if one system — or a combination of systems — would fail? Can we run simulations to prepare everyone from software engineers to CEOs how to handle a data breach, or a database failure, or a power outage? Can we build up more resilience by ensuring that there are carefully-designed redundancies in our most critical systems?

Reliable systems do exist. The United States has not had a casualty from an airplane crash since 2013, and on a domestic carrier since 2009. There were 823 million passengers in the United States in 2016 alone. Clearly highly-redundant and reliable systems can be produced if we want them to and build the right culture of maintenance, safety, and reliability.

For everyone, but particularly software engineers: let’s get back to basics. It’s better to have more reliable but less features than more features that are breaking every other day. Let’s move slow and break nothing. Reliability and resilience may just be the next major wave of technology we are waiting for, and for those of us who rely on our systems day in and day out, we are ready for this stable world.