How Twitter Uses Open Source

Twitter’s Chris Aniszcyk gave a keynote address this morning at CloudOpen and talked about how Twitter uses open source.

His talk provided insights into how open source technology can also be used in an enterprise environment for scaling infrastructure. That’s an emerging topic of interest in the enterprise world.

Aniszcyk reviewed the open source technologies Twitter depends on to manage its service:

  • MySQL is heavily used for primary storage of  tweets. The company developed its own MySQL fork in the open to collaborate with the upstream community. MySQL is an open source relational database.
  • Cassandra, Hadoop, Lucene, Pig and a variety of Apache projects are used within the Twitter  infrastructure to power services such as analytics and search. The company also contribute back to these projects. Twitter is a sponsor of the Apache Software Foundation. Cassandra is a NoSQL database. Hadoop is a distributed file system often used with higher level languages like Pig is a high level platform for big data analytics. Lucene is an open source search technology.
  • Memcached is used heavily in the company’s caching infrastructure to scale its ever-growing traffic. The company recently open sourced Twemcache which was heavily inspired by the Memcached code base. Memecached helps speed up dynamic web applications by alleviating database loads.

Twitter also develops software for its own purposes that it makes available via open source:

  • Iago is a load generator that was created to help test services before they encounter production traffic.
  • Zipkin is a distributed tracing system that the company created to help gather timing data for the services involved in managing a request to the Twitter API. In essence, it helps make Twitter faster.
  • Scalding is a Scala library that makes it easy to write MapReduce jobs in Hadoop. Scalding was developed for Cascading, a framework that is designed  for Java developers to build big data applications on top of Hadoop. It is known for its ability to abstract the complexities of MapReduce and making Hadoop clusters easier to manage. MapReduce was originally developed by Google for processing search data. Scala is a general purpose programming language. It expresses common programming patterns.

Facebook and Google have also open sourced their technologies. The results are evident in the enterprise. Hadoop, for instance, was developed primarily by Yahoo! It is now a cornerstone of the big data push we are seeing across the enterprise market.

(Thanks to Jen Cloer of the Linux Foundation for sharing the summaries of what Chris planned to talk about in his keynote.)