Amid shift to remote work, application performance monitoring is IT’s big moment

In recent weeks, millions have started working from home, putting unheard-of pressure on services like video conferencing, online learning, food delivery and e-commerce platforms. While some verticals have seen a marked reduction in traffic, others are being asked to scale to new heights.

Services that were previously nice to have are now necessities, but how do organizations track pressure points that can add up to a critical failure? There is actually a whole class of software to help in this regard.

Monitoring tools like Datadog, New Relic and Elastic are designed to help companies understand what’s happening inside their key systems and warn them when things may be going sideways. That’s absolutely essential as these services are being asked to handle unprecedented levels of activity.

At a time when performance is critical, application performance monitoring (APM) tools are helping companies stay up and running. They also help track root causes should the worst case happen and they go down, with the goal of getting going again as quickly as possible.

We spoke to a few monitoring vendor CEOs to understand better how they are helping customers navigate this demand and keep systems up and running when we need them most.

IT’s big moment

IT staff keep systems running, but like many who work behind the scenes, few people think much about what they do until something goes wrong. As businesses navigate the COVID-19 crisis, they are learning just how valuable these employees are, says Lew Cirne, CEO and founder at New Relic.

“IT people are unsung heroes. They’re working behind the scenes. Everything is suddenly cranked up on the demands of their system and it’s where they’re really shining. They are basically doing their job when nobody notices because everything’s just working,” Cirne told TechCrunch.

He says technology is working to increase infrastructure capacity while the medical field is working to flatten the infection curve. “This is a moment of truth for technology. And while the world is busy flattening the curve as we should be, […] the opposite is happening in technology, the curve is spiking like you wouldn’t believe,” he said.

Cirne cites one of his customers, an online learning platform that had a usage spike of 380% in just one week, as just one example. His customers also include Zoom and BlueJeans, two video conferencing companies coping with huge spikes in demand in recent weeks.

Predicting problems before they happen

The engineers tasked with keeping these systems running need a lot of information, says Datadog founder and CEO Olivier Pomel. The new usage requirements are only exacerbating that.

“For every outage you see from a service you might use, there are probably dozens or hundreds of smaller incidents behind the scenes, and many of these incidents require a response. Even if they may not directly impact customers at that time, they would if you leave them unmitigated,” Pomel explained.

He says Datadog’s job is to make sure customers can understand all of those incidents as quickly as possible and actually help predict problems before they happen so they get in front of the issues before they become a problem for users.

Shay Banon, founder and CEO at Elastic, says the first step to finding a problem is knowing what’s normal and what’s not, but it’s hard to know that when you are suddenly asked to scale to new levels very quickly.

When usage suddenly increases tenfold, Banon says the principal assumptions you used to make no longer apply. That forces you to rethink how your systems operate in the new reality, which requires observability into your infrastructure. “APM is a huge aspect of observability as a whole, but there is a huge concept related to APM. It’s about logging. It’s about metrics. It’s about infrastructure monitoring. It’s about [all of that],” he says.

Working remotely

According to Pomel, IT teams tasked with monitoring critical systems have also been forced to work from home, which creates new challenges for these teams. Pomel is seeing more people using its product collaboratively in a way they might not have had to when they could convene in-person for a quick meeting to discuss an issue.

“In addition to scaling up, all of these companies are switching to working remotely, and as such, there’s also a real increased need for their engineering teams and product teams to collaborate within the platforms they are using because you can’t actually get together and huddle behind a couple of screens and try to piece together data from disparate sources,” Pomels says. “So it’s even more important to understand the issue, because you don’t have people in [the same place].”

Elastic’s Banon says his company has always been remote, so he can advise CEOs who are trying to do this for the first time and let them know which solutions worked for his company.

“Some companies are trying to figure out how to operate in this new world. We are a truly distributed company. The fact we are working from home isn’t new to us. I have been talking to other CEOs and other leaders in various companies about how we make decisions and how we prioritize,” he said.

Each vendor has its own way of helping customers figure it out, but it’s about giving them the tools to understand what’s happening inside their systems. If APM can’t always keep services up, it can minimize disruptions when sites goes down due to overuse.

As Cirne says, understanding a problem requires an investigative process. “Monitoring tells you when something is broken. Observability lets you ask why there is a problem, giving you the ability to ask ad hoc questions you might not have predicted,” he said.

So the next time your service just works — and especially when it doesn’t — think about the army of IT pros out there working to get it going and the monitoring tools that help them understand what’s going on. Those folks have your back, now more than ever.