Google’s Real Time Big Data Tool Cloned By Apache Drill

Google, as you might expect, has massive amounts of data and it’s built many tools to handle it. Stuff like MapReduce and GoogleFS, which spawned the open source Apache Hadoop, and BigTable, which spawned Apache HBase.

But Google didn’t stop with those projects. It’s continued to create new big data tools and continues to publish papers about them. Dremel (PDF) is designed to make querying the huge data sets stored in GoogleFS and BigTable much faster. Where a MapReduce job on Hadoop could take hours or even days, Dremel makes results available almost instantly.

Apache Drill is an attempt to build an open source version of Google Dremel, and the project was recently accepted into the Apache Incubator program. It’s supported by MapR, a company that sells a modified version of Hadoop with proprietary customizations.

There are other open source real-time big data systems, notably Storm, which was developed at Backtype and open sourced by Twitter, and Apache S4, which was open sourced by Yahoo. Storm in particular has gotten a lot of attention lately, and Nodeable launched a cloud hosted version of recently.

Nodeable CEO Dave Rosenberg says the big difference between Dremel and other real-time big data systems such as Storm and S4 is that these are streaming engines, while Dremel is designed for ad-hoc querying, ie really fast search results.

Hadoop is for batch processing, meaning that queries are run on a set of data that you already have. Streaming engines process data as it comes in. The terms “streaming” and “real time” are often used interchangeably, which could lead to some confusion about Dremel/Drill since they are also referred to as real time.

“Ultimately you’ll probably want both ad-hoc and real-time, which presents a few challenges,” Rosenberg says. “In our eyes, the real-time computation is more technically difficult and arguably more valuable as there are a lot more immediate use cases. That said, users will always need to run queries against data. We just think the latency on a query is more acceptable than on the computation side–at least for now.”

There’s another project in the works to create an open source version of Dremel called OpenDremel. Other projects working on speedy queries for big data include Apache CouchDB and the Cloudant backed variant BigCouch.