6 Experts on Speeding Up Data

Speed. That’s what it’s all about these days. The problem: it’s still more effective to use FedEx than trying to squeeze a data load across a network. It’s an absurd reality when it requires a plane to move data from one place to another.

It’s not necessary to move terabytes of data all day, all night. Moving hard drives across the continent for a feature film is different from pulling in data to analyze and then presenting in an application. But the loads will have to get heavier  with the connectivity of smartphones, the invisible geofence around your house, 3-D printers and the endless variety of data objects available to aggregate and analyze.

In applications, the complexity of moving data is requiring new ways to use Flash and RAM. Hard drives are outdated, their mechanical parts not capable of keeping up with the volume and velocity of data that companies are analyzing. New databases are emerging. Startups and large companies like SAP are developing in-memory databases. NoSQL databases have become the darlings of the developer community.

The need for speed in application performance and analysis has endless dimensions. Matt Turck, who recently joined FirstMark Capital as a managing partner, commented in an interview last week at their offices in New York that the Internet of Things (IoT) creates  friction with data transfer. He cited the rise of MQTT, an IoT protocol for passing data that the New York Times says is “not really a lingua franca for machine-to-machine communication, but a messenger and carrier for data exchange.”

The MQTT inventor discovered the need for the messaging protocol when he started automating his 16th century thatched roof cottage on the Isle of Wight.

That ball the child rolls across the floor? As I discussed with Turck, It’s not a ball but a data object with its own social identity, that could someday connect to trillions of other objects. It will become an avatar, known more as data object than the Spaldeen the child bounces on the stoop of his family’s Brooklyn brownstone. Now think of all the data that will pass from objects such as this ball and you can sense the scope of a world of zettabyte dimensions.

To get some perspective, I asked some experts about the new reality of data that seems to be encompassing just about everything these days. Their views reflect less about the future than what is actually happening today.

MemSQL

Late in April, MemSQL offered general availability for its real-time anlaytics platform. The platform uses an in-memory database and a distributed infrastructure to analyze large amounts of data. The database is built for speed.

Eric Frenkiel, CEO and Co-Founder of MemSQL, said in an email interview that the last decade has seen great improvements in data retention at scale. “Companies have used compressed columnar stores and Hadoop to store large volumes of data, conferring a competitive advantage against companies that don’t retain and analyze their data,” Frenkiel said. “But big data is only useful if it’s accessible, and with larger data volumes, it’s become evident that companies are struggling to process this data to keep pace with the speed of their business.”

He said that ss companies become increasingly data-driven, faster databases are necessary to counteract the inertia of these large data sets. “With data retention and storage solved, the next fundamental shift is speed,” Frenkiel said. “Swapping out hard disks for flash storage has helped somewhat, but the real innovation lies in evolving the software responsible for storing and analyzing that data.”

At its core, MemSQL is an aggregator that acts as the intelligence for the data analytics. Analysis is orchestrated across the nodes, which are there just to carry ou the commands of the aggregator. These nodes are “wonderfully unaware.” They are the foot soldiers, Frenkiel said.

“Companies are looking at in-memory computing because it offers a transformative approach to solving big data problems,” Frenkiel said. “In-memory computing solves the velocity component, but it needs to be paired with a scale-out architecture to satisfy the volume component.”

He added that in terms of adoption, companies will continue to leverage existing solutions but are augmenting their data warehouses with databases that can quickly consume and analyze data to make fast decisions. “The companies that will win are the ones that take advantage of the velocity of data to spot trends and identify anomalies as they are in the process of occurring,” Frenkiel said. “These companies still have to analyze huge data volumes, but they now have to architect for that velocity to gain real-time insights and spot competitive advantages.”

Enigma

Engima won the Battlefield competition at Disrupt NY this past week. Co-Founder Hicham Oudghiri said the need for speed can be thought of also as the need for ever expanding data. But he said the problems of scale for Enigma and other like Palantir are really entirely different than what most startups are facing everyday.

“Usually, when you think of scale, you think of users, millions of users, hundreds of millions of users, and they’re all more or less trying to look at the same thing (or the same “type” of thing from a schema perspective),”  Oudghiri said. “So you have these people hitting your servers concurrently, and you need to think about things that are further up your lines of defenses first. You think about round-robin strategies for web servers, content delivery networks like CloudFlare and Amazon’s auto scale to match the peaks in traffic. But for data intensive apps, the problem is kind of reversed. You don’t actually need that many users to have scalability problems, because you have billions and billions of rows to inspect for any query. On top of that, often you have to bring together very disparate data schemas across datasets that are very, very, diverse in nature and content.”

He said there is no magic answer to the problem. But redundancy is most important when considering how data is stored.

“There is no “right” model. SQL, noSQL, graph databases, etc.,” Oudghiri said. “Use them all. They all have their purpose and when used together in concert can really help you get most of the way there. Think of your database architecture as a whole less as one ideal system and more as a collection of complimentary voices that can live harmoniously together. That way you can also be flexible and tinker with different parts at a time.

“Second, RAM is your friend and persistency is something you have to work for in your cluster of servers. We scale our search in RAM. It’s cheap enough to be able to do so at this point, so the real challenge is building software around maintaining persistency for things you store in RAM.”

I followed up with him after hearing from others, who said the cost of RAM is an issue.

“Some of our stuff is in SSD, but it’s just not fast enough for very horizontal search applications,” Oudghiri said. “Also, the degradation of the SSD drives is a little unpredictable. Though RAM is not persistent at all, at least I can count on its non-persistence. It’s really a risk/reward calculation.”

10gen

The team at 10gen takes a different perspective. 10gen is the corporate sponsor for MongoDB, the NoSQL database. Jared Rosoff, Technical Director at 10gen said in an email interview that there are at least two elements to speed: application development and application performance.

“There’s little argument that MongoDB speeds application development,” Rosoff said. “The flexible data model and idiomatic drivers make developers more productive and able to iterate features quickly.”

Regarding application performance, MongoDB is designed to use all system memory as a cache of recently used data, allowing developers to achieve in-memory performance when your working set fits in memory.”

DRAM becomes less viable as the workloads get bigger, Rosoff said.

“But when tackling big-data workloads on a purely in-memory database, you’ll be forced to buy enough DRAM to fit your entire data set,” Rosoff said. “This is challenging because of the upper limit of how much RAM you can put in a single server, and the cost of the RAM itself. MongoDB can use disk and flash-based storage to handle much larger data sets on a single server. Solid State Disks (SSDs) and Flash storage devices are allowing many customers to run MongoDB at nearly in-memory performance at a fraction of the cost of purely in-memory systems.”

Finally, MongoDB’s document data model ensures data locality, reducing the number of disk IO’s required for complex data models, which is critical for providing high-performance.

SlashDB

SlashDB makes APIs out of relational databases. Founder Victor Olex asked if it is the speed that people need. In his view, it is the database that has become the problem.

“The speed of access to data depends not only on the distance (measured network latency) but also the amount of time it takes to retrieve data from rest (i.e. file system, database) and amount of data transformations along the way (i.e. format conversions, encoding, decoding, compression etc.),” Olex said. “Getting data from rest to fly also depends very much on the data structures implemented and how directly the data can be referenced. In context of enterprise data we have to focus on databases, which have become a bottleneck in today’s web-scale information systems. Relational databases allow for convenient declarative queries but that comes at a cost of good amount of in-memory computation and disk access to determine which records need to be sent back. Conversely, document store database generally require retrieving data by numerical keys assigned to them when they were stored or predefined indexes (lookup tables), which map search terms to those numerical keys.

“Various caching technologies help with data access speed but do so at the expense of accuracy. For example a web page may be slightly out of date for the duration of cache retention setting but that may be an acceptable trade-off to save processing resources on the publisher’s database. Caching and web application layer can scale out to run simultaneously on multiple servers, allowing for parallel handling of incoming request traffic while traditional database servers though can work with multiple connections at once generally can only scale up (bigger box, more memory and processor). SlashDB was designed with web scale architecture in service of enterprise systems. It is a scalable web service, which interprets URLs into database queries and delivers response in HTTP. It can run on multiple nodes and be coupled with a typical HTTP proxy to facilitate caching of repetitive requests.

“Data intensive processing has traditionally been the domain of server-side systems. With the proliferation of mobile device this changes somewhat but ultimately there is a limit at which a human person can process information, which reportedly is at 60 bits per second so one could argue that delivering anything beyond that is a waste (except we all want know about those tweets and new emails happening in the background).”

AlchemyAPI

“AlchemAPI is obsessed with speed,” said CEO and Founder Elliot Turner. “Our customers recognize the time value of information; therefore, we have a financial incentive to process data as quickly as possible. There is no easy fix to making data processing applications run more quickly; bottlenecks can occur within a variety of areas including the ability to store and retrieve data or analyze it.

“When it comes to performing efficient analysis of data, approaches can include hardware acceleration via GPUs, distributed computing, and algorithmic innovations. AlchemyAPI leverages all of these in combination with a heavy usage of solid state technology (both RAM and SSD) to store, retrieve, and process data at high rates of speed.

“With regard to solid state technologies, RAM costs are dropping steadily (http://www.jcmit.com/mem2013.htm) but are still an order of magnitude more expensive than SSDs and 100x more expensive than traditional hard drives. RAM is still somewhat cost prohibitive for truly “big data” applications that operate at the petabyte scale, therefore many vendors (including AlchemyAPI) are leveraging hybrid approaches that incorporate both RAM and SSD technology. However, we’re seeing a steady embrace of RAM for smaller deployments in the terabyte range and expect this trend to continue as prices decrease over time.”

http://www.slideshare.net/alchemyapi/efficient-big-data-analysis-gluecon-2012

Summary

There is more to speeding data than using SSD, RAM or new database technology. But one thing is certain. The world is becoming a highly distributed data mesh that will require news ways to speed up data as the loads get larger and larger.