LinkedIn open-sources its WhereHows data discovery and lineage portal

LinkedIn today open-sourced WhereHows, a meta data-centric tool the company has long used internally to make it easier for its employees to discover data the company generates and to track the lineage of its datasets as they move around its various internal tools and services.

Now that almost every modern business creates massive amounts of data, simply managing how all this information flows across an organization becomes virtually impossible. Sure, you can store it in a data warehouse, but at the end of the day, you end up with a large number of datasets that are very similar, or different versions of an original dataset, or information that has been transformed so it can be used by different tools. The exact same data also often ends up in multiple systems, just with different names or maybe version numbers. In the end, how do you know which dataset you should work with when you are building a new product (or maybe just an executive report)?


This, LinkedIn’s Shirshanka Das and Eric Sun told me, was the problem the company was facing. So the team developed WhereHows, which functions as a central repository and web-based portal for keeping track of what happens to data in a large company like LinkedIn, or even a smaller one that has to deal with lots of heterogeneous data. At LinkedIn, WhereHows currently stores data about the status of 50,000 datasets, 14,000 comments and 35 million job executions. The company says all of this data relates to information that covers about a 15 petabyte footprint.

LinkedIn is a big Hadoop user, but the tool can also track data from other systems (think Oracle databases, Informatica, etc.).

WhereHows gives developers access to both an API and a web interface that allows employees to visualize the lineage of a dataset, annotate it and more.

As Das and Sun noted, LinkedIn has a long history of open sourcing products that aren’t part of its core competency. The idea here is to encourage conversation; as the large big-data ecosystem adopts this and similar tools, the company eventually benefits from this, as well. Similar to a lot of other companies I talk to, LinkedIn also notes that open source helps it elevate its engineering brand, which in turn makes recruiting easier.