A while back, Yahoo quietly made the code to Omid, an open source transaction processing system for the Apache HBase Hadoop big data store, available on GitHub. This is the same software the company uses internally to help it power thousands of search transactions per second.
Until now, Yahoo remained rather subdued about this project, but with the latest update, launching today, it feels the service is now robust enough for wider deployment and has proven its ability to scale. It’s also 10 times faster than the first version the company released to the public.
Yahoo’s director of engineering Ralph Rabbat and senior director of product management Sumeet Singh told me earlier this week that the company hopes that other platforms in the Hadoop and HBase ecosystem will adopt Omid.
Indeed, Yahoo hopes Omid will follow a trajectory similar to Hadoop. Hadoop, after all, began at Yahoo, and the company is one of its largest users and has remained very active in the open source efforts around it. Rabbat and Singh hope that Omid will eventually become an official Apache project, just like HBase. In an effort to reach out to the open-source community, the company plans to publish a series of blog posts about deploying and using Omid over the next few weeks.
By default, HBase does not conform to the ACID (Atomicity, Consistency, Isolation, Durability) principles of database design. Omid aims to ensure that applications can perform read and write operations on HBase with ACID properties by extending the HBase key-value API with transaction semantics.
As Rabbat told me, the company looked at the gap between traditional relational databases (which don’t scale all that well) and NoSQL databases (which typically don’t have transaction support). What was missing for Yahoo were transactions that allowed for HBase to process small updates individually. Google solved this with Percolator, but that’s still a proprietary system. Omid then, is in a way an open-source implementation of Google Percolator.
Internally, Yahoo uses Omid on top of its Sieve content management system, which drives — among other things — its Search platform. That’s essentially a multi-petabyte HBase store that stores billions of documents. There, Omid helps power tens of thousands of transactions per second.
Rabbat and Singh believe Omid could be really useful in other applications, too, though. Apache Phoenix — which essentially implements SQL on top of HBase — could use it as its transaction management component, for example. Any HBase system that needs to support incremental real-time processing, though, could use Omid, too. As Singh also noted, those don’t have to be web-scale implementations, either. Omid can work just as well at a smaller scale.
For Yahoo, the main benefit of open sourcing a project like Omid is that many of the community’s improvements will directly help it improve its own service. That’s something that held true for Hadoop, and the company hopes to replicate this success with projects like Omid.
“The value we got out of Hadoop has only increased since we open sourced it,” Singh said. Rabbat added that open source is also an increasingly important recruiting tool for the company and that, as the company stepped up its acquisition efforts since Marissa Mayer took the reins of the company, integrating the technologies of other companies has often been relatively easy because those companies already used Hadoop, too.
The Omid code is now available on GitHub.