Apache Hadoop is becoming the de facto infrastructure environment for pushing data across a distributed infrastructure to then later analyze with MapReduce in an effort to optimize web pages, personalize content or increase the effectiveness of online advertising.
There’s just one problem.
Hadoop is not meant as a storage environment. Metadata is kept on one server. Data is replicated three times to make sure nothing gets lost. That can get expensive when the data store is petabytes in size. And if there’s a failure on the metadata server, the replicated data can become entirely inaccessible. Further, maintaining three copies of data can lead to significant overhead and management costs.
Cleversafe believes it has the answer by combining its object-storage dispersal technology with the capabilities of Hadoop, MapReduce.
Cleversafe uses a technique called erasure coding. It take data and slices it into little pieces. The slices are distributed to separate disks, storage nodes and geographic locations. Once dispersed, Cleversafe’s Informational Dispersal Algorithms (IDA) constitutes the data from a subset of the slices originally stored.
Here’s the twist. Cleversafe proposes that after dispersal, the data goes back into Hadoop for analysis. Hadoop does best when the data is brought to the computation. This is accomplished with the Cleversafe technique.
Cleversafe CEO Chris Gladwin tells us the benefit comes in three ways:
Does this represent the next generation of big data analytics? Hadoop has inherent weaknesses in terms of its storage capabilities. It’s why it is a natural fit with storage vendors such as EMC and IBM.
The difference for Cleversafe boils down to its unique erasure capabilities for building large-scale clouds such as the one it is helping develop now for Lockheed Martin to serve federal agencies.
But erasure coding does have its flaws as pointed out by Wikibon’s Dave Vellante. He points out that erasure coding is math heavy and requires considerable system resources to manage:
As such you need to architect different methods of managing data with plenty of compute resource. The idea is to spread resources over multiple nodes, share virtually nothing across those nodes and bet on Intel to increase performance over time. But generally, such systems are most appropriate for lower performance applications, making archiving a perfect fit.
If it can truly do all it says it can do then Cleversafe may prove a formidable player in the emerging big data land grab.
But I am not so sure. There’s a reason why Amazon Web Services and Google use a commodity infrastructure. It is affordable and can be continuously optimized. As efficiencies get better, the prices go down. On AWS Map Reduce, you can lease clusters to run Hadoop, MapReduce jobs.
Cleversafe has a compelling case but the methods used by AWS are an example of the benefits that come with distributed infrastructures built on very cheap hardware.