How you react when your systems fail may define your business

Just around 9:45 a.m. Pacific Time on February 28, 2017, websites like Slack, Business Insider, Quora and other well-known destinations became inaccessible. For millions of people, the internet itself seemed broken.

It turned out that Amazon Web Services was having a massive outage involving S3 storage in its Northern Virginia datacenter, a problem that created a cascading impact and culminated in an outage that lasted four agonizing hours.

Amazon eventually figured it out, but you can only imagine how stressful it might have been for the technical teams who spent hours tracking down the cause of the outage so they could restore service. A few days later, the company issued a public post-mortem explaining what went wrong and which steps they had taken to make sure that particular problem didn’t happen again. Most companies try to anticipate these types of situations and take steps to keep them from ever happening. In fact, Netflix came up with the notion of chaos engineering, where systems are tested for weaknesses before they turn into outages.

Unfortunately, no tool can anticipate every outcome.

It’s highly likely that your company will encounter a problem of immense proportions like the one that Amazon faced in 2017. It’s what every startup founder and Fortune 500 CEO worries about — or at least they should. What will define you as an organization, and how your customers will perceive you moving forward, will be how you handle it and what you learn.

We spoke to a group of highly-trained disaster experts to learn more about preventing these types of moments from having a profoundly negative impact on your business.

It’s always about your customers

Reliability and uptime are so essential to today’s digital businesses that enterprise companies developed a new role, the Site Reliability Engineer (SRE), to keep their IT assets up and running.

Tammy Butow, principal SRE at Gremlin, a startup that makes chaos engineering tools, says the primary role of the SRE is keeping customers happy. If the site is up and running, that’s generally the key to happiness. “SRE is generally more focused on the customer impact, especially in terms of availability, uptime and data loss,” she says.

Companies measure uptime according to the so-called “five nines,” or 99.999 percent availability, but software engineer Nora Jones, who most recently led Chaos Engineering and Human Factors at Slack, says there is often too much of an emphasis on this number. According to Jones, the focus should be on the customer and the impact that availability has on their perception of you as a company and your business’s bottom line.

Someone needs to be calm and just keep asking the right questions.

“It’s money at the end of the day, but also over time, user sentiment can change [if your site is having issues],” she says. “How are they thinking about you, the way they talk about your product when they’re talking to their friends, when they’re talking to their family members. The nines don’t capture any of that.”

Robert Ross, founder and CEO at FireHydrant, an SRE as a Service platform, says it may be time to rethink the idea of the nines. “Maybe we need to change that term. Maybe we can popularize something like ‘happiness level objectives’ or ‘happiness level agreements.’ That way, the focus is on our products.”

When things go wrong

Companies go to great lengths to prevent disasters to avoid disappointing their customers and usually have contingencies for their contingencies, but sometimes, no matter how well they plan, crises can spin out of control. When that happens, SREs need to execute, which takes planning, too; knowing what to do when the going gets tough.

Someone needs to be calm and just keep asking the right questions, says Butow, because the initial reaction is probably going to be a bit of panic, no matter how well prepared you think you are. During an actual disaster in which she was involved, a key database went down, leaving her company dead in the water on a Friday afternoon. As a response, she started asking questions.

“I was just trying to keep everyone calm and focused on restoring service. That’s what you want to do you when something’s wrong,” she says today. “You need to get up and running safely. So I was just asking questions like, ‘What’s your backup strategy?’ ‘Can we recover?’ ‘How are we going to recover?’ Let’s try and talk through the different solutions.”

Leslie Carr, senior director of engineering at Quip, says it’s good to have a plan for the roles people will play in such a disaster. She recommends putting one person in charge of communications to the rest of the company, while one or more people deal directly with the technical problem at hand. When a person is totally focused on solving the problem, they probably aren’t going to take the time to be communicators (and you probably don’t want them to), she says.

You can’t automate your way out of this

Some may think adding an automation layer to help monitor and fix these incidents automatically is a solution, but Ryan Kitchens, senior SRE at Netflix, says the complexity of these problems make it difficult to automate, and automation itself can lead to its own set of issues. He says a 1983 paper by Lisanne Bainbridge, The Ironies of Automation helped define his thinking on the subject.

“The more automation you add in, you actually introduced new problems that you have to account for, and particularly with the role of AI and machine learning,” he says.

If these were repeatable occurrences, says Ross AI could help, but by their nature, these kinds of incidents tend to be unique.

“One of the things I think about with AI and incidents, and AI kind of responding the way that a human would, is that would indicate to me that you’re having a lot of the same incidents. One of the things about incident responses is that you’re trying to not have the same thing happen over and over and over again, and AI is specifically tuned to learn heuristics about one thing,” says Ross.

Digging for answers

As Ross pointed out, these kinds of disasters are so hard to plan for because each one tends to be a unique set of circumstances. That’s why Butow recommends taking screen shots in real time. That set of snapshots can act as a kind of guide for you when you go back and figure out what went wrong.

Jessica DeVita, senior reliance engineering advocate at Netflix (previously at Microsoft) says it’s critical to not blame an individual — even if one person precipitated the event, the underlying problem is going to be systemic.

“We latch onto something and then socially construct a cause. These kinds of issues have paths going back years,” says DeVita. “How was it that the database was able to be dropped? You have to look at the systemic factors. It is not just that one guy made a mistake.”

When you actually get past the blame game and dig into what happened, you’ll usually find that there are much bigger issues at play, which makes blaming the individual an easy way out.

Carr agrees, noting that when you actually get past the blame game and dig deep you’ll usually find that there are much bigger issues at play, which makes blaming the individual an easy way out. “While no one else might be blaming this individual, they’re often blaming themselves, and then the rest of the organization is kind of okay with that, which is a problem in itself, as well,” she said. She says that when you investigate complex kinds of problems, you find that as DeVita said, it’s rarely that simple.

Interestingly enough, if you look at the Amazon post mortem from 2017, while they point to an individual causing a problem, the system had been designed in such a way to allow that to happen. Based on that, they recalibrated the systems so that this type of mistake couldn’t happen again, at least in the same way. As DeVita and Carr stated, it wasn’t the individual’s mistake that the system allowed that to happen the way it did at that moment — it was how the system developed over time.

There are lots of tools like chaos engineering that are designed to help companies test for possible issues before they happen, but they still happen because there is a complex mix of humans and machines involved. Often, these systems have been built by a variety of people over a number of years, layer upon layer.

As a result, it’s impossible to know which lever will cause a disaster or when, no matter how much testing you do. The best companies can do is be prepared, and when it happens, stay calm, get your systems up and running, then figure out why it happened without falling prey to the blame game.

Remember; regardless of how technical the issue might be, you still need keep your customers satisfied.