Are site reliability engineers the next data scientists?

It’s no secret that “data scientist” is one of the hottest job titles going. DJ Patil famously proclaimed data scientist “The Sexiest Job of the 21st Century” before moving on to join the White House as the first chief data scientist of the U.S. Once a rarefied in-house role at a few leading Internet companies such as LinkedIn and PayPal, data science has since grown into a global phenomenon, impacting organizations of all sizes across many industries.

More recently, a buzzy new job title has emerged from the same group of companies: that of site reliability engineer, or SRE. Will SREs follow the same path of rapid growth that data scientists did before them? Before we dive into that question, let’s consider the context that has led to the creation of site reliability engineering.

The new IT stack

Over the last 15 years, the largest Internet properties have quietly led a revolution in IT technology. The reason is simple: Traditional corporate data center techniques simply would not efficiently scale up to the level that is required to run a global service like Google or Facebook. Instead, these companies have had to innovate at all layers of the technology stack, from hardware to networking to applications.

In many cases, the resulting building blocks have been released as open source software packages, or have inspired third parties to create their own versions. Now, organizations ranging from startups to the largest Fortune 500 enterprises are adopting these technologies for their own purposes.

Examples of this phenomenon are numerous. To pick just a few:

  • Containers. Google’s widespread internal adoption of lightweight OS containers inspired the rapidly growing movement around Docker, driving the company at the center of this phenomenon to $162 million in funding and prompting the creation of industry-wide collaborations like the Open Container Project.
  • Cluster management. Google’s internal Borg project similarly inspired two fast-growing open source communities around the Kubernetes and Mesos cluster resource management frameworks, setting the stage for efforts like the Cloud Native Computing Foundation.
  • Analytics. Google’s data processing innovations inspired Yahoo’s early investments into Hadoop, which has in turn spawned a whole ecosystem of modern big data technologies and commercial players, including Cloudera and Hortonworks.
  • Microservices. Amazon and Netflix were early innovators and evangelists in the practice of designing software applications as suites of independently deployable services, an approach that is also being widely adopted in industry in the form of products like Lightbend’s Reactive Platform (formerly Typsafe).

A unifying theme of these technologies is higher efficiency and lower cost at larger scale. But source code won’t solve these challenges in isolation. It must be complemented by new management techniques, methodologies and tools. In other words, the big picture needs to consider people and process as much as it does software.

The rise of site reliability engineering (SRE)

For inspiration on the people and process front, we can similarly look to the web-scale Internet companies. Many of the early innovators have rallied around the concept of site reliability engineering.

Ben Treynor, who joined Google as a site reliability tsar in 2003, has described SRE as “what happens when a software engineer is tasked with what used to be called operations.” Over the last decade, the team that Treynor started at Google has grown from a handful of production engineers to more than 1,000 SREs.

It’s important for IT teams to respond proactively and holistically to the change that is afoot.

Moreover, the SRE concept has been embraced by other major Internet companies, including Dropbox, Airbnb, Netflix and many more. Job listings site Indeed now lists hundreds of SRE positions. The SRE community now even has its own conference, dubbed SREcon.

Andrew Widdowson, an SRE at Google, relates the discipline to competitive auto sports: “Our work is like being a part of the world’s most intense pit crew. We change the tires of a race car as it’s going 100mph.”

As any competitive racing fan knows, a faster engine and chassis doesn’t mean much without a world-class pit crew, equipped with the right tools, techniques and strategies to keep it in the lead. In Formula 1 racing, the days of winning races based on gut instinct are waning. Today’s winning teams are differentiated by real-time streaming data analytics as much as they are by pistons and tires.

SRE-in-a-box

It’s all well and good to be inspired by the large Internet companies, but how do we integrate the SRE discipline into existing enterprise IT teams?

Just like companies like Cloudera packaged the early “tribal knowledge” around data engineering and turned it into turnkey products accessible to a mass IT audience, a new batch of companies is packaging the principles of SRE for the masses. Recently introduced Rocana Ops is an example. [Disclosure: I am an investor in Rocana.]

Rocana Ops gives administrators visibility into the inner workings of their data centers and applications. Just as a Bloomberg terminal enables brokers to monitor and investigate activity across markets, Rocana Ops uses big data techniques, combined with data visualization, to guide IT operators to the root cause of any issue in their complex IT infrastructure. Companies using Rocana Ops to power their IT operations gain the capabilities of the site reliability engineer discipline, without the steep learning curve.

A motivating example

Consider the example of a contemporary multi-channel e-commerce application. A typical modern system might be comprised of core business logic implemented in Scala, linked to a legacy off-the-shelf Java order management system, backed by multiple transactional databases (say, both MongoDB and Oracle), fronted by a Node.js API tier.

Some pieces of this puzzle may be deployed in an on-premise data center, while other components live on a public cloud provider like Amazon Web Services.

There will be dependencies on third-party services (perhaps Stripe for payments), and a mix of web endpoints and native mobile apps for Android and iOS interacting with the core system through an API gateway.

Will SREs follow the same path of rapid growth that data scientists did before them?

Now, consider a typical business-critical problem that could crop up: request timeouts are driving shopping cart abandonment by mobile app users. How long would it take to notice the problem to begin with? Once the problem is identified, given such a complex web of interacting technologies, where would one even start to look for the underlying root cause?

Is it a network issue, a database performance problem or an application error introduced in the most recent release?

With an SRE-inspired approach, system logs and telemetry are continuously collected, in real time, from all components of the system, and stored in a central data store. Machine-learning algorithms identify anomalous events (such as the rash of timeouts from mobile devices that represent a statistical outlier compared to historical patterns) and surface them to the attention of IT staff.

A rich web interface incorporating data visualizations guides the admin to the most relevant log events, highlighting other contemporaneous behavior changes observed across all elements of the IT infrastructure, wherever they reside.

Armed with the ability to quickly narrow in on the relevant data, the underlying problem can be identified.

Adapting to the new normal

The new stack is infiltrating IT infrastructure already, driven at a grass-roots level by progressive developers and IT operators. Given that, it’s important for IT teams to respond proactively and holistically to the change that is afoot.

Here are a few recommendations on how to approach this:

  • When adopting new technologies like containers, cluster schedulers and microservices, consider the process and people elements as much as the software.
  • In addition to looking toward Internet companies for technology inspiration, also consider people and process innovations, such as the emerging site reliability engineering discipline.
  • Evaluate packaged software solutions like Rocana Ops that provide off-the-shelf tooling to bring the practices of SREs to existing enterprise IT operations teams.

How long will it be until we have a Chief Site Reliability Engineer of the United States? Given the challenges of rolling out healthcare.gov in recent years, this may be a case of “the sooner, the better.” Regardless, while we await that milestone, it’s not too soon to consider the implications of the SRE discipline in your own organization.