Startups

What your company can learn from the Bank of England’s resilience proposal

Comment

Balloon flying too close to cactus; overvaluations
Image Credits: Richard Drury (opens in a new window) / Getty Images

Kolton Andrus

Contributor

Kolton is co-founder and CEO of Gremlin, the chaos engineering company helping the world build a more reliable internet.

More posts from Kolton Andrus

The outages at RBS, TSB and Visa left millions of people unable to deposit their paychecks, pay their bills, acquire new loans and more. As a result, the House of Commons’ Treasury Select Committee (TSC) began an investigation of the U.K. finance industry and found the “current level of financial services IT failures is unacceptable.” Following this, the Bank of England (BoE), Prudential Regulation Authority (PRA) and Financial Conduct Authority (FCA) decided to take action and set a standard for operational resiliency.

While policies can often feel burdensome and detached from reality, these guidelines are reasonable steps that any company across any industry can exercise to improve the resilience of their software systems.

The BoE standard breaks down to these five steps:

  1. Identify critical business services based on those that end users rely on most.
  2. Set a tolerance level for the amount of outage time during an incident that is acceptable for that service, based on what utility the service provides.
  3. Test if the firm is able to stay within that acceptable period of time during real-life scenarios.
  4. Involve management in the reporting and sign-off of these thresholds and tests.
  5. Take action to improve resiliency against the different scenarios where feasible.

Following this process aligns with best practices in architecting resilient systems. Let’s break each of these steps down and discuss how chaos engineering can help.

Identify critical business services

The operational resilience framework recommends focusing on the services that serve external customers. While internal applications are important for productivity, this customer-first mentality is sound advice for determining a starting place for reliability efforts. While it’s ultimately up to the business to weigh the criticality of the different services they offer, the ones necessary to make payments, retrieve payments, investing or insuring against risks are all recommended priorities.

For example, retail companies can prioritize the customer’s critical shopping and checkout path as a place to start. Business SaaS companies can start with their customer-facing applications, especially those with Service Level Agreements (SLAs). To pick a simple example, Salesforce would focus reliability efforts on their CRM, not on their internal ticketing systems.

The second part of this stage is mapping, where firms “identify and document the necessary people, processes, technology, facilities and information (referred to as resources) required to deliver each of its important business services.” The valuable insight here is that mapping an application doesn’t stop at the microservices that make up the application itself — so neither should the reporting and testing! Sometimes even a service that’s thought to be noncritical can take down other critical services (due to unfound bugs, etc.), so companies need to be aware of this unintentional tight coupling.

This is where Chaos Engineering comes into play. Firms can map critical dependencies by running network attacks to see which services make up the application and find any unknown or undocumented dependencies. Then, incrementally scale the testing to include multiple services.

Set tolerance levels

The next step is setting the tolerance levels, also known as Service Level Agreements (SLAs), customized to the criticality of the service. In other words, banks must preemptively set and agree to the amount of time it takes to restore service during any named incident. There are multiple Service Level Indicators (SLIs) suggested to track — e.g., outage time and the number of failed requests — but the paper suggests using outage time as the primary metric. The initial paper also recommends self-assessment to determine the acceptable outage time, but the three governing bodies have since taken a stronger stance. The new standard is two consecutive days of maximum downtime, which from my worldview is still too low of a standard.

The paper also talks about the importance of setting what they call an “impact tolerance,” which tries to assume all scenarios that could happen and then outlines agreed upon timeframes for remediation. Impact tolerance can be considered a Recovery Time Objective (RTO) metric, and is slightly stricter than a time bound Service Level Objective (SLO). For example, region-loss scenarios have occurred but are quite rare (less frequent than once a year) — yet the requirements state that a bank should be prepared to recover in under two days from such an incident.

However, in order to be realistic and limit the universe of possible failures, the paper suggests starting with the scenarios that have happened to the firm before, or to others in their industry. For example, if retailers see an outage at another retailer, it behooves them to replicate that incident in their environment to ensure resilience.

I believe this is the right approach — if you try and do too much too soon, it can quickly become overwhelming. I’d recommend running fire drills to see how fast systems self-heal or measure your team’s mean time to recover (MTTR) to known, likely incidents.

Test operational resilience

This is the step where chaos engineering is most directly applicable. The guidance on testing comes down to performing severe, but plausible failure scenarios seen by the firm or others in the industry. The proposal recommends varying the severity of attacks by the number of resources or time the resources are unavailable. For example, if you are testing a database failover scenario, you can begin with testing the database with a small amount of latency, then move to drop the connection to the primary node, then grow to lose multiple nodes until you see how your system handles losing connection to the entire database cluster.

Additionally, the paper recommends testing third party resources. This is a best practice in the new API-driven economy, where modern applications are built by stitching together services from other teams inside or outside of our organization. If an eCommerce store’s payment service provided by a third party goes down, the store should be prepared for this outage and have a failover solution in place. Chaos engineering lets you simulate these scenarios without actually having to bring down the service.

Management buy-in and continuous improvement

None of this should be performed in isolation. Setting the impact tolerance is a critical risk management process that should involve buy-in from senior management. The impact tolerance levels provide a clear metric that can be reported to the senior management to ensure that they are included in the decision-making process when it comes to determining the criticality of needed improvements.

The proposal recommends firms take action by leveraging the findings of the testing stage and management’s cost-benefit analysis to prioritize areas for improvement. The recommendations stated include:

  • replacing outdated or weak infrastructure.
  • increasing system capacity.
  • achieving full fail-over capability.
  • addressing key person dependencies.
  • being able to communicate with all affected parties.

This rounds out the chaos experiment lifecycle as well. Once a system weakness is identified, companies should fix the bug based on severity, and then rerun the chaos experiments to ensure there isn’t a regression. Conducting post-mortems and publishing the results for management and other teams to learn from the findings helps enhance their work.

Conclusion

Following the process outlined by the BoE, RPA and FCA aligns with best practices for identifying and prioritizing areas for resiliency improvements. The process is currently opt-in beginning October 1, and while the standards only apply to U.K.-based institutions, out of strong customer focus many multinational firms are preparing to voluntarily adhere to these standards. The process outlined parallels much of the tech industry’s best practices and could be considered astute guidance for any company wanting to build more resilient systems.

More TechCrunch

Welcome back to TechCrunch’s Week in Review. This week had two major events from OpenAI and Google. OpenAI’s spring update event saw the reveal of its new model, GPT-4o, which…

OpenAI and Google lay out their competing AI visions

Expedia says Rathi Murthy and Sreenivas Rachamadugu, respectively its CTO and senior vice president of core services product & engineering, are no longer employed at the travel booking company. In…

Expedia says two execs dismissed after ‘violation of company policy’

When Jeffrey Wang posted to X asking if anyone wanted to go in on an order of fancy-but-affordable office nap pods, he didn’t expect the post to go viral.

With AI startups booming, nap pods and Silicon Valley hustle culture are back

OpenAI’s Superalignment team, responsible for developing ways to govern and steer “superintelligent” AI systems, was promised 20% of the company’s compute resources, according to a person from that team. But…

OpenAI created a team to control ‘superintelligent’ AI — then let it wither, source says

A new crop of early-stage startups — along with some recent VC investments — illustrates a niche emerging in the autonomous vehicle technology sector. Unlike the companies bringing robotaxis to…

VCs and the military are fueling self-driving startups that don’t need roads

When the founders of Sagetap, Sahil Khanna and Kevin Hughes, started working at early-stage enterprise software startups, they were surprised to find that the companies they worked at were trying…

Deal Dive: Sagetap looks to bring enterprise software sales into the 21st century

Keeping up with an industry as fast-moving as AI is a tall order. So until an AI can do it for you, here’s a handy roundup of recent stories in the world…

This Week in AI: OpenAI moves away from safety

After Apple loosened its App Store guidelines to permit game emulators, the retro game emulator Delta — an app 10 years in the making — hit the top of the…

Adobe comes after indie game emulator Delta for copying its logo

Meta is once again taking on its competitors by developing a feature that borrows concepts from others — in this case, BeReal and Snapchat. The company is developing a feature…

Meta’s latest experiment borrows from BeReal’s and Snapchat’s core ideas

Welcome to Startups Weekly! We’ve been drowning in AI news this week, with Google’s I/O setting the pace. And Elon Musk rages against the machine.

Startups Weekly: It’s the dawning of the age of AI — plus,  Musk is raging against the machine

IndieBio’s Bay Area incubator is about to debut its 15th cohort of biotech startups. We took special note of a few, which were making some major, bordering on ludicrous, claims…

IndieBio’s SF incubator lineup is making some wild biotech promises

YouTube TV has announced that its multiview feature for watching four streams at once is now available on Android phones and tablets. The Android launch comes two months after YouTube…

YouTube TV’s ‘multiview’ feature is now available on Android phones and tablets

Featured Article

Two Santa Cruz students uncover security bug that could let millions do their laundry for free

CSC ServiceWorks provides laundry machines to thousands of residential homes and universities, but the company ignored requests to fix a security bug.

1 day ago
Two Santa Cruz students uncover security bug that could let millions do their laundry for free

TechCrunch Disrupt 2024 is just around the corner, and the buzz is palpable. But what if we told you there’s a chance for you to not just attend, but also…

Harness the TechCrunch Effect: Host a Side Event at Disrupt 2024

Decks are all about telling a compelling story and Goodcarbon does a good job on that front. But there’s important information missing too.

Pitch Deck Teardown: Goodcarbon’s $5.5M seed deck

Slack is making it difficult for its customers if they want the company to stop using its data for model training.

Slack under attack over sneaky AI training policy

A Texas-based company that provides health insurance and benefit plans disclosed a data breach affecting almost 2.5 million people, some of whom had their Social Security number stolen. WebTPA said…

Healthcare company WebTPA discloses breach affecting 2.5 million people

Featured Article

Microsoft dodges UK antitrust scrutiny over its Mistral AI stake

Microsoft won’t be facing antitrust scrutiny in the U.K. over its recent investment into French AI startup Mistral AI.

1 day ago
Microsoft dodges UK antitrust scrutiny over its Mistral AI stake

Ember has partnered with HSBC in the U.K. so that the bank’s business customers can access Ember’s services from their online accounts.

Embedded finance is still trendy as accounting automation startup Ember partners with HSBC UK

Kudos uses AI to figure out consumer spending habits so it can then provide more personalized financial advice, like maximizing rewards and utilizing credit effectively.

Kudos lands $10M for an AI smart wallet that picks the best credit card for purchases

The EU’s warning comes after Microsoft failed to respond to a legally binding request for information that focused on its generative AI tools.

EU warns Microsoft it could be fined billions over missing GenAI risk info

The prospects for troubled banking-as-a-service startup Synapse have gone from bad to worse this week after a United States Trustee filed an emergency motion on Wednesday.  The trustee is asking…

A US Trustee wants troubled fintech Synapse to be liquidated via Chapter 7 bankruptcy, cites ‘gross mismanagement’

U.K.-based Seraphim Space is spinning up its 13th accelerator program, with nine participating companies working on a range of tech from propulsion to in-space manufacturing and space situational awareness. The…

Seraphim’s latest space accelerator welcomes nine companies

OpenAI has reached a deal with Reddit to use the social news site’s data for training AI models. In a blog post on OpenAI’s press relations site, the company said…

OpenAI inks deal to train AI on Reddit data

X users will now be able to discover posts from new Communities that are trending directly from an Explore tab within the section.

X pushes more users to Communities

For Mark Zuckerberg’s 40th birthday, his wife got him a photoshoot. Zuckerberg gives the camera a sly smile as he sits amid a carefully crafted re-creation of his childhood bedroom.…

Mark Zuckerberg’s makeover: Midlife crisis or carefully crafted rebrand?

Strava announced a slew of features, including AI to weed out leaderboard cheats, a new ‘family’ subscription plan, dark mode and more.

Strava taps AI to weed out leaderboard cheats, unveils ‘family’ plan, dark mode and more

We all fall down sometimes. Astronauts are no exception. You need to be in peak physical condition for space travel, but bulky space suits and lower gravity levels can be…

Astronauts fall over. Robotic limbs can help them back up.

Microsoft will launch its custom Cobalt 100 chips to customers as a public preview at its Build conference next week, TechCrunch has learned. In an analyst briefing ahead of Build,…

Microsoft’s custom Cobalt chips will come to Azure next week

What a wild week for transportation news! It was a smorgasbord of news that seemed to touch every sector and theme in transportation.

Tesla keeps cutting jobs and the feds probe Waymo