If You Think Big Data's Big Now, Just Wait

We’ve been hearing a great deal about the power of Big Data over the last couple of years. The hype says that as we gather more data, we will be able to get better answers to business problems, but the data levels we have seen so far could pale in comparison to what happens when we start adding sensors to the world. When everything from jet engines to soft drink vending machines to car seats have sensors in them, we will see an explosion of data the likes of which we’ve never seen before, and everything else that came before will seem, well, small.

And the implications of such a world with every device sending in data to a waiting database are difficult to imagine at this point.

As Natasha Lomas pointed out in a recent TechCrunch article, Sensors And Sensitivity, one outcome could be that, instead of using devices to deliberately measure our activity in the world, perhaps sensors could pick up our activity and measure it as we move through the world and interact with various sensors -without having a specific device devoted to it.

“The world around us gains the ability to perceive us, rather than wearable sensors trying to figure out what’s going on in our environment by taking a continuous measure of us,” Lomas wrote.

And to some extent we are in the very early stages of seeing this happen. When it actually begins to take off though, Peter Levine, a partner at venture capital firm Andreessen Horowitz sees the addition of pervasive sensor data as an industry altering event. “Think about what happens when the Internet of Things becomes more pervasive. Endpoint devices in the trillions will be sending some information back to a compute engine…We as businesses and humans want to do something with that in real time.”

And it is precisely that real time element that today’s relational databases seem to struggle with. As Levine pointed out, if we are talking about a jet engine or a security monitoring system, we can’t afford to wait 3 or 4 hours to process that data. We need to be able to get answers in near real time.

I wrote about one company trying to process huge tracts of data called Adatao, which received $13M in funding last week led by Levine’s company. In fact, writing in the company blog about the Adatao funding, Levine had this to say about the ever-growing amount of data:

“The promise of big data has ushered in an era of data intelligence. From machine data to human thought streams, we are now collecting more data each day, so much that 90% of the data in the world today has been created in the last two years alone. In fact, every day, we create 2.5 quintillion bytes of data — by some estimates that’s one new Google every four days, and the rate is only increasing…,” Levine wrote.

With those kind of numbers, we need better tools to process the data. One company that recognizes this problem is GE, which builds huge industrial grade equipment like jet engines, railroad locomotives, pipelines and electric grids and it wants to instrument these huge industrial devices and take better advantage of the data they generate to make operations run more efficiently.

To give you a sense of how early we are in the process, Bill Ruh, VP of the global software center at GE, says industry estimates suggest there will be 17B connected industrial assets in place by 2025. He estimates today that just 10 percent of those devices are equipped with sensors, and most of those lack the intelligence they hope to have in the future, simply telling them when something has gone very wrong.

GE has been working with Pivotal to better understand this problem. (It’s worth noting that GE owns a 10 percent stake in the EMC/VMware spinoff). The two companies are working together to build what they are calling a “data lake.” This is a more flexible approach to large data sets than a data warehouse, which they point out was designed a decade ago with ERP and CRM data in mind. The quantity of data today is so much greater that it requires a much more flexible architecture to accommodate it.

One of the first areas where GE is testing this technology is in its jet engine division where it estimates each engine can generate 1TB of data of data from a single flight. Multiply that by many flights per day and you are facing monumental amounts of data from just one industrial device.

Ruh claims that using data lake software cuts the time they can begin working with the data from days to minutes. How much did it improve the process? According to GE, they whittled down a data warehousing approach that took 30 days to ingest, structure, integrate and process and brought it down to 20 minutes with the data lake approach. Yes, you read it correctly, 20 minutes.

If that’s true and they claim that it is, that’s a remarkable level of efficiency gain and one that takes into account the massive amount of data they have to deal with from just their jet engine business.

But Ruh says, they didn’t stop there. They didn’t want to just have the data available fast for the sake of proving it could be done. They wanted to do something with that data, and use it to better understand how the engine was working and possibly even predict part failures before they happened.

To that end they combined it with technology they developed with the consulting firm Accenture to build a tool called Taleris, which they claim can actually accurately predict part failure. The trouble was that they needed a serious amount of data for the prediction platform to do its job. The data lake developed with Pivotal gives them that.

Zebra Technologies is another company thinking a lot about this. They are the folks who bought Motorola Handheld Solutions back in April for 3.45B. They see a big connection between what Zebra does in the barcode, receipt, kiosk and RFID printer business, the handheld scanning business they bought from Motorola and the future they see of sensors in the warehouse.

Phil Gerskovich, SVP of New Growth Platforms at Zebra Technologies says his company is very much thinking about how sensors will play out in the warehouse and he sees it involving one of three things:

What is it?
Where is It?
What is its condition?

And he says these questions can apply whether you’re talking about employees, fruit or blue jeans. For all the technology we have today, he says it’s still hard for many large warehouse operations to answer these questions and he believes sensors will change that. To that end his company has already developed a cloud-based software framework called Zatar, which he describes as a platform for connecting and managing millions of devices and connecting them to enterprise applications, which can then make use of the stream of information coming from the sensors in whichever way makes sense in the context of that particular piece of software.

As one more example of how this could work, when I was at Mobile World Congress last winter, SAP demonstrated a smart vending machine for me. The machine not only learned who I was and created a surprising level of social interaction between machine and human, it also broadcast information about itself back to the warehouse where humans could see if it was running out of product or if it was showing signs of needing maintenance soon. The system was designed in such a way that SAP claimed it could prioritize maintenance calls by location so that a machine needing attention at a stadium where there was an event that evening got a service call before some random office building.

Companies like Adatao and projects like these going on inside large companies like GE, Zebra and SAP are proofs of concept right now, but over time these types of companies and projects could have a profound impact on how we process and deal with big data. Part of the paradox of Big Data is that the more data we have, the more overwhelming it can be, but it often takes a large amount of data to give you the best outcomes.

The good news is there are companies trying to solve these problems today before the majority of the sensors are in place and the inevitable big data onslaught comes because the time to figure this all out is today, not after they are out there transmitting data.