Enterprise

Are site reliability engineers the next data scientists?

Comment

Donald Fischer

Contributor
Donald Fischer is a venture partner at General Catalyst.

It’s no secret that “data scientist” is one of the hottest job titles going. DJ Patil famously proclaimed data scientist “The Sexiest Job of the 21st Century” before moving on to join the White House as the first chief data scientist of the U.S. Once a rarefied in-house role at a few leading Internet companies such as LinkedIn and PayPal, data science has since grown into a global phenomenon, impacting organizations of all sizes across many industries.

More recently, a buzzy new job title has emerged from the same group of companies: that of site reliability engineer, or SRE. Will SREs follow the same path of rapid growth that data scientists did before them? Before we dive into that question, let’s consider the context that has led to the creation of site reliability engineering.

The new IT stack

Over the last 15 years, the largest Internet properties have quietly led a revolution in IT technology. The reason is simple: Traditional corporate data center techniques simply would not efficiently scale up to the level that is required to run a global service like Google or Facebook. Instead, these companies have had to innovate at all layers of the technology stack, from hardware to networking to applications.

In many cases, the resulting building blocks have been released as open source software packages, or have inspired third parties to create their own versions. Now, organizations ranging from startups to the largest Fortune 500 enterprises are adopting these technologies for their own purposes.

Examples of this phenomenon are numerous. To pick just a few:

  • Containers. Google’s widespread internal adoption of lightweight OS containers inspired the rapidly growing movement around Docker, driving the company at the center of this phenomenon to $162 million in funding and prompting the creation of industry-wide collaborations like the Open Container Project.
  • Cluster management. Google’s internal Borg project similarly inspired two fast-growing open source communities around the Kubernetes and Mesos cluster resource management frameworks, setting the stage for efforts like the Cloud Native Computing Foundation.
  • Analytics. Google’s data processing innovations inspired Yahoo’s early investments into Hadoop, which has in turn spawned a whole ecosystem of modern big data technologies and commercial players, including Cloudera and Hortonworks.
  • Microservices. Amazon and Netflix were early innovators and evangelists in the practice of designing software applications as suites of independently deployable services, an approach that is also being widely adopted in industry in the form of products like Lightbend’s Reactive Platform (formerly Typsafe).

A unifying theme of these technologies is higher efficiency and lower cost at larger scale. But source code won’t solve these challenges in isolation. It must be complemented by new management techniques, methodologies and tools. In other words, the big picture needs to consider people and process as much as it does software.

The rise of site reliability engineering (SRE)

For inspiration on the people and process front, we can similarly look to the web-scale Internet companies. Many of the early innovators have rallied around the concept of site reliability engineering.

Ben Treynor, who joined Google as a site reliability tsar in 2003, has described SRE as “what happens when a software engineer is tasked with what used to be called operations.” Over the last decade, the team that Treynor started at Google has grown from a handful of production engineers to more than 1,000 SREs.

Moreover, the SRE concept has been embraced by other major Internet companies, including Dropbox, Airbnb, Netflix and many more. Job listings site Indeed now lists hundreds of SRE positions. The SRE community now even has its own conference, dubbed SREcon.

Andrew Widdowson, an SRE at Google, relates the discipline to competitive auto sports: “Our work is like being a part of the world’s most intense pit crew. We change the tires of a race car as it’s going 100mph.”

As any competitive racing fan knows, a faster engine and chassis doesn’t mean much without a world-class pit crew, equipped with the right tools, techniques and strategies to keep it in the lead. In Formula 1 racing, the days of winning races based on gut instinct are waning. Today’s winning teams are differentiated by real-time streaming data analytics as much as they are by pistons and tires.

SRE-in-a-box

It’s all well and good to be inspired by the large Internet companies, but how do we integrate the SRE discipline into existing enterprise IT teams?

Just like companies like Cloudera packaged the early “tribal knowledge” around data engineering and turned it into turnkey products accessible to a mass IT audience, a new batch of companies is packaging the principles of SRE for the masses. Recently introduced Rocana Ops is an example. [Disclosure: I am an investor in Rocana.]

Rocana Ops gives administrators visibility into the inner workings of their data centers and applications. Just as a Bloomberg terminal enables brokers to monitor and investigate activity across markets, Rocana Ops uses big data techniques, combined with data visualization, to guide IT operators to the root cause of any issue in their complex IT infrastructure. Companies using Rocana Ops to power their IT operations gain the capabilities of the site reliability engineer discipline, without the steep learning curve.

A motivating example

Consider the example of a contemporary multi-channel e-commerce application. A typical modern system might be comprised of core business logic implemented in Scala, linked to a legacy off-the-shelf Java order management system, backed by multiple transactional databases (say, both MongoDB and Oracle), fronted by a Node.js API tier.

Some pieces of this puzzle may be deployed in an on-premise data center, while other components live on a public cloud provider like Amazon Web Services.

There will be dependencies on third-party services (perhaps Stripe for payments), and a mix of web endpoints and native mobile apps for Android and iOS interacting with the core system through an API gateway.

Now, consider a typical business-critical problem that could crop up: request timeouts are driving shopping cart abandonment by mobile app users. How long would it take to notice the problem to begin with? Once the problem is identified, given such a complex web of interacting technologies, where would one even start to look for the underlying root cause?

Is it a network issue, a database performance problem or an application error introduced in the most recent release?

With an SRE-inspired approach, system logs and telemetry are continuously collected, in real time, from all components of the system, and stored in a central data store. Machine-learning algorithms identify anomalous events (such as the rash of timeouts from mobile devices that represent a statistical outlier compared to historical patterns) and surface them to the attention of IT staff.

A rich web interface incorporating data visualizations guides the admin to the most relevant log events, highlighting other contemporaneous behavior changes observed across all elements of the IT infrastructure, wherever they reside.

Armed with the ability to quickly narrow in on the relevant data, the underlying problem can be identified.

Adapting to the new normal

The new stack is infiltrating IT infrastructure already, driven at a grass-roots level by progressive developers and IT operators. Given that, it’s important for IT teams to respond proactively and holistically to the change that is afoot.

Here are a few recommendations on how to approach this:

  • When adopting new technologies like containers, cluster schedulers and microservices, consider the process and people elements as much as the software.
  • In addition to looking toward Internet companies for technology inspiration, also consider people and process innovations, such as the emerging site reliability engineering discipline.
  • Evaluate packaged software solutions like Rocana Ops that provide off-the-shelf tooling to bring the practices of SREs to existing enterprise IT operations teams.

How long will it be until we have a Chief Site Reliability Engineer of the United States? Given the challenges of rolling out healthcare.gov in recent years, this may be a case of “the sooner, the better.” Regardless, while we await that milestone, it’s not too soon to consider the implications of the SRE discipline in your own organization.

More TechCrunch

For years, Silicon Valley and Wall Street have questioned Mark Zuckerberg’s decision to invest tens of billions of dollars into Reality Labs. This week, Meta’s wearables division unveiled a prototype…

Meta offers a glimpse through its supposed iPhone killer: Orion

When the U.S. Feds cut interest rates by half a percentage point last week, it was a dash of good news for venture capitalists backing one particularly beleaguered class of…

VCs expect a surge in startups offering lower rate mortgages, other loans now that the Feds cut rates

The video debuted along with a research paper of the same name at IEEE’s International Conference on Robotics and Automation in Rotterdam this week.

Robot hand can detach from arm, crawl over to objects, and pick them up

There are many iPad apps to help you organize recipes; sync tasks across devices; be more productive; and manage your notes.

Best iPad apps to boost productivity and make your life easier

While online discourse would make it seem that venture has retreated to the Bay Area, with San Francisco being the most important place to build a startup, Index Ventures is…

Why Index Ventures is bulking up its investment team in NYC

In August, a Russian warlord posted a video on Telegram, showing a pair of Cybertrucks patrolling a road in Chechnya, armed seemingly with heavy machine guns. Leaving aside unanswerable (for…

A Russian warlord said he’ll take Cybertrucks into Ukraine; some experts think that’s unwise

WordPress.org has lifted its ban on hosting provider WP Engine until October 1, after putting a block on it earlier this week. The block prevented several sites from updating their…

WordPress.org temporarily lifts its ban on WP Engine

The world of WordPress, one of the most popular technologies for creating and hosting websites, is going through a very heated controversy. The core issue is the fight between WordPress…

The WordPress vs. WP Engine drama, explained

ChatGPT could get more expensive to use in coming years. The New York Times, citing internal OpenAI docs, reports that OpenAI is planning to raise the price of individual ChatGPT…

OpenAI might raise the price of ChatGPT to $44 by 2029

Binance founder Changpeng “CZ” Zhao was released from U.S. custody on Friday after serving out his four-month sentence in a low-security correctional facility. CZ’s sentence was the product of a…

Binance founder ‘CZ’ released from custody after four-month sentence

EV startup Canoo has been hit with two new lawsuits from suppliers linked to the drivetrains that power its electric vehicles, just weeks after the company kicked off a major…

Canoo hit with two supplier lawsuits as last remaining co-founder leaves

Welcome to Startups Weekly — your weekly recap of everything you can’t miss from the world of startups. Want it in your inbox every Friday? Sign up here. This week…

AI dominated both YC Demo Day and startup news

Three Iranian hackers working for the Islamic Revolutionary Guard Corps (IRGC) targeted the Trump campaign in an attempted hack-and-leak operation, according to the Department of Justice.

Iranian hackers charged with hacking Trump campaign to ‘stoke discord’

Wordy is a new iOS app that offers a unique way to learning English. The app automatically translates and defines unknown words while you watch your favorite movies or TV…

Wordy’s new app helps you learn vocabulary while watching movies and TV shows

The WSJ reports that OpenAI’s next funding round, worth around $6.5 billion, could close as soon as the first week in October.

OpenAI’s $6.5B funding round may close as soon as next week

We’re thrilled to welcome Bret Taylor to TechCrunch Disrupt 2024. As the former co-CEO of Salesforce, founder of Quip, former CTO of Facebook, the co-creator of Google Maps, and current…

Bret Taylor of Sierra joins TechCrunch Disrupt 2024

The U.K.s’ antitrust authority has concluded that Amazon’s partnership and equity investment in AI startup Anthropic can’t be investigated under current merger rules due to the size and scope of…

Amazon dodges antitrust scrutiny in UK over Anthropic investment

We’re in the final hours to save up to $600 on TechCrunch Disrupt 2024 tickets! Grab your tickets now and seize this final opportunity for major savings before the countdown…

Last hours to snag up to $600 off TechCrunch Disrupt 2024 passes

Reset your clocks: Meta has been hit with yet another privacy penalty in Europe. On Friday, Ireland’s Data Protection Commission (DPC) announced a reprimand and a €91 million fine —…

Meta fined $101.5M for 2019 breach that exposed hundreds of millions of Facebook passwords

The world’s second-largest money transfer provider, which filed a data breach notice with U.K. authorities, serves over 50 million people.

UK data watchdog confirms it’s investigating MoneyGram data breach

Note-taking apps typically aim to make you more efficient and productive. A lot of those apps concentrate on quickly jotting down your thoughts, organizing them better, or a mix of…

Napkin is a note-taking app that is not about making you more productive

Here are the startups from YC Demo Day 2 that we thought stood out from the flock.

9 startups that stood out on YC Demo Day 2

UAE-based Redwood has acquired a majority stake in the game streaming platform Loco as the Indian firm looks to expand focus to international markets, TechCrunch has learned and confirmed. Redwood,…

Indian game streaming startup Loco sells majority stake to Redwood

SpaceX’s Starlink satellite internet network is expected to hit a new customer milestone this week, company President Gwynne Shotwell told Texas legislators on Tuesday.  “This week, by the way, we…

Starlink hits 4 million subscribers

AI video generators need to believe that filmmakers will use their models in the production process. Otherwise why exist? To jump-start the new AI film ecosystem, Runway has set aside…

Runway earmarks $5M to fund up to 100 films using AI-generated video

Departures might be dominating the week’s OpenAI-related headlines. But comments on AI bias from Anna Makanju, the company’s VP of global affairs, also grabbed our attention. Makanju, speaking on a…

OpenAI’s VP of global affairs claims o1 is ‘virtually perfect’ at correcting bias, but the data doesn’t quite back that up

Lending startup Figure will be launching an AI tool powered by GPT-4 to help catch errors in lending documents. 

Former Brex COO who now heads unicorn fintech Figure says GPT is already upending the mortgage industry

Drata, a security compliance automation platform that helps companies adhere to frameworks such as SOC 2 and GDPR, has laid off 9% of its workforce, amounting to 40 people. Founded in 2020, Drata integrates…

Security compliance unicorn Drata lays off 9% of its workforce

As OpenAI boasts about its o1 model’s increased thoughtfulness, small, self-funded startup Nomi AI is building the same kind of technology. Unlike the broad generalist ChatGPT, which slows down to…

Nomi’s companion chatbots will now remember things like the colleague you don’t get along with

The company recently closed a $130 million round, according to an SEC filing, bringing the total to $327 million.

Zap Energy investors in recent $130M round included Soros Fund and Laurene Powell Jobs’ Emerson Collective