Facebook Releases Data Storage Cluster Code As Open Source

Dare Obasanjo has pointed to the Casandra Project, a new open source project hosted on Google Code from Facebook. Cassandra is a P2P clustered data storage engine developed by Facebook, and heavily inspired by the BigTable project at Google. The code is written in Java and has been made available under an Apache 2.0 license.

As Dare says, the model used in Cassandra is straight-forward and effective:

The entire system is a giant table with lots of rows. Each row is identified by a unique key. Each row has a column family, which can be thought of as the schema for the row. A column family can contain thousands of columns which are a tuple of {name, value, timestamp} and/or super columns which are a tuple of {name, column+} where column+ means one or more columns. This is very similar to the data model behind Google’s BigTable.

While Google has released a lot of open source code, the core technology stack (Big Table, MapReduce, GFS) is all proprietary and until recently was not widely discussed outside of the company. Facebook has now made a number of internal projects available as open source, including their platform and API. Both Google and Facebook took advantage of open source operating systems and technologies in building out their platforms, and Facebook seem to be willing to give a lot more back than Google. At this rate, Facebook could soon be a top open source contributor.

Top 10 open source contributors by company (source)

Position Company Name Man-Month Contribution Monetary Value
1 Sun Microsystems 51,372 Person-months 312m euros
2 IBM 14,865 Person-months 90m euros
3 Red Hat 9,748 Person-months 59m euros
4 Silicon Graphics 7,736 Person-months 47m euros
5 SAP 7,493 Person-months 46m euros
6 MySQL 5,747 Person-months 35m euros
7 Netscape 5,249 Person-months 32m euros
8 Ximian 4,985 Person-months 30m euros
9 Realnetworks 4,412 Person-months 27m euros
10 AT&T 4,286 Person-months 26m euros