It seemed so simple. A small schema issue in a database was wrecking a feature in the app, increasing latency and degrading the user experience. The resident data engineer pops in a fix to amend the schema, and everything seems fine — for now. Unbeknownst to them, that small fix completely clobbered all the dashboards used by the company’s leadership. Finance is down, ops is pissed, and the CEO — well, they don’t even know whether the company is online.
For data engineers, it’s not just a recurring nightmare — it’s a day-to-day reality. A decade plus into that whole “data is the new oil” claptrap, and we’re still managing data piecemeal and without proper systems and controls. Data lakes have become data oceans and data warehouses have become … well, whatever the massive version of a warehouse is called (a waremansion I guess). Data engineers bridge the gap between the messy world of real life and the precise nature of code, and they need much better tools to do their jobs.
As TechCrunch’s unofficial data engineer, I’ve personally struggled with many of these same problems. And so that’s what drew me into Datafold.
Datafold is a brand-new platform for managing the quality assurance of data. Much in the way that a software platform has QA and continuous integration tools to ensure that code functions as expected, Datafold integrates across data sources to ensure that changes in the schema of one table doesn’t knock out functionality somewhere else.
Founder Gleb Mezhanskiy knows these problems firsthand. He’s informed from his time at Lyft, where he was a data scientist and data engineer, and later transformed into a product manager “focused on the productivity of data professionals.” The idea was that as Lyft expanded, it needed much better pipelines and tooling around its data to remain competitive with Uber and others in its space.
His lessons from Lyft inform Datafold’s current focus. Mezhanskiy explained that the platform sits in the connections between all data sources and their outlets. There are two challenges to solve here. First, “data is changing, every day you get new data, and the shape of it can be very different either for business reasons or because your data sources can be broken.” And second, “the old code that is used by companies to transform this data is also changing very rapidly because companies are building new products, they are refactoring their features … a lot of errors can happen.”
In equation form: messy reality + chaos in data engineering = unhappy data end users.
With Datafold, changes made by data engineers in their extractions and transformations can be compared for unintentional changes. For instance, maybe a function that formerly returned an integer now returns a text string, an accidental mistake introduced by the engineer. Rather than wait until BI tools flop and a bunch of alerts come in from managers, Datafold will indicate that there is likely some sort of problem, and identify what happened.
The key efficiency here is that Datafold aggregates changes in datasets — even datasets with billions of entries — into summaries so that data engineers can understand even subtle flaws. The goal is that even if an error transpires in 0.1% of cases, Datafold will be able to identify that issue and also bring a summary of it to the data engineer for response.
Datafold is entering a market that is, quite frankly, as chaotic as the data being processed. It sits in the key middle layer of the data stack — it’s not the data lake or data warehouse for storing data, and it isn’t the end user BI tools like a Looker, Tableau or many others. Instead, it’s part of a number of tools available for data engineers to manage and monitor their data flows to ensure consistency and quality.
The startup is targeting companies with at least 20 people on their data team — that’s the sweet spot where a data team has enough scale and resources that they are going to be concerned with data quality.
Today Datafold is three people, and will be debuting officially at YC’s Demo Day later this month. Its ultimate dream is a world where data engineers never again have to get an overnight page to fix a data quality issue. If you’ve been there, you know precisely why such a product is valuable.