Quantcast Open Sources Hadoop Distributed File System Alternative

Quantcast, an internet audience measurement and ad targeting service, processes over 20 petabytes of data per day using Apache Hadoop and its own custom file system called Quantcast File System (QFS). Today, it’s making that technology available to as open source under an Apache license. You can now find it on GitHub.

The default Hadoop file system is called Hadoop Distributed File System (HDFS). CEO Konrad Feldman says Quantcast started using Hadoop in 2006. In 2008 as Quantcast started collecting 1TB of data per day the team realized that it was going to need a file system with better throughput than HDFS.

They settled on using the open source Kosmos Distributed File System (kosmosfs), but didn’t feel that it as production ready. So they hired Sriram Rao, the lead architect of kosmosfs, to work on making it production ready. The result was QFS, which Quantcast has been using in product for about four years now, though Rao has since left the company for Microsoft.

Feldman says internal benchmarks found that QFS reads are up to 75% at faster than HDFS and writes up to 46% faster. QFS uses a more efficient data replication system based on Reed-Solomon coding so that data takes up less disk space — this also reduces the overall IO required. And QFS uses C++, which has a few performance benefits over Java as well.

However, he cautions that Hadoop users may find HDFS a better solution if they’re handling relatively small volumes of data, or if they depend on HDFS-specific features such as head node federation or hot standby.

He says the company doesn’t plan to offer a commercial version or enterprise support — it’s just contributing back to the open source community. But hey, famous last words and all that.

Quantcast isn’t the only company that has replaced HDFS. MapR‘s commercial distribution of Hadoop uses a proprietary file system. DataStax Enterprise uses Apache Cassandra to replace HDFS.