9 investors discuss hurdles, opportunities and the impact of cloud vendors in enterprise data lakes

About a decade ago, I remember having a conversation with a friend about big data. At the time, we both agreed that it was the purview of large companies like Facebook, Yahoo and Google, and not something most companies would have to worry about.

As it turned out, we were both wrong. Within a short time, everyone would be dealing with big data. In fact, it turns out that huge amounts of data are the fuel of machine learning applications, something my friend and I didn’t foresee.

Frameworks were already emerging like Hadoop and Spark and concepts like the data warehouses were evolving. This was fine when it involved structured data like credit card info, but data warehouses weren’t designed for unstructured data you needed to build machine learning algorithms, and the concept of the data lake developed as a way to take unprocessed data and store until needed. It wasn’t sitting neatly in shelves in warehouses all labeled and organized, it was more amorphous and raw.

Over time, this idea caught the attention of the cloud vendors like Amazon, Microsoft and Google. What’s more, it caught the attention of investors as companies like Snowflake and Databricks built substantial companies on the data lake concept.

Even as that was happening startup founders began to identify other adjacent problems to attack like moving data into the data lake, cleaning it, processing it and funneling to applications and algorithms that could actually make use of that data. As this was happening, data science advanced outside of academia and became more mainstream inside businesses.

At that point there was a whole new modern ecosystem and when something like that happens, ideas develop, companies are built and investors come. We spoke to nine investors about the data lake idea and why they are so intrigued by it, the role of the cloud companies in this space, how an investor finds new companies in a maturing market and where the opportunities and challenges are in this lucrative area.

To learn about all of this, we queried the following investors:

Caryn Marooney, general partner, Coatue Management
Dharmesh Thakker, general partner, Battery Ventures
Casey Aylward, principal, Costanoa Ventures
Derek Zanutto, general partner, CapitalG
Navin Chaddha, managing director, Mayfield
Jon Lehr, co-founder and general partner, Work-Bench
Peter Wagner, founding partner, Wing Ventures
Nicole Priel, managing director, Ibex Ventures
Ilya Sukhar, partner, Matrix Partners

Where are the opportunities for startups in the data lakes space with players like Snowflake and the cloud infrastructure vendors so firmly established?

Caryn Marooney: The data market is very large, driven by the opportunity to unlock value through digital transformation. Both the data lake and data warehouse architectures will be important over the long term because they solve different needs.

For established companies (think big banks, large brands) with significant existing data infrastructure, moving all their data to a data warehouse can be expensive and time consuming. For these companies, the data lake can be a good solution because it enables optionality and federated queries across data sources.

Dharmesh Thakker: Databricks (which Battery has invested in) and Snowflake have certainly become household names in the data lake and warehouse markets, respectively. But technical requirements and business needs are constantly shifting in these markets — and it’s important for both companies to continue to invest aggressively to maintain a competitive edge. They will have to keep innovating to continue to succeed.

Regardless of how this plays out, we feel excited about the ecosystem that’s emerging around these players (and others) given the massive data sprawl that’s occurring across cloud and on-premise workloads, and around a variety of data-storage vendors. We think there is a significant opportunity for vendors to continue to emerge as “unification layers” between data sources and different types of end users (including data scientists, data engineers, business analysts and others) in the form of integration middleware (cloud ELT vendors); real-time streaming and analytics; data governance and management; data security; and data monitoring. These markets shouldn’t be underestimated.

Casey Aylward: There are a handful of big opportunities in the data lake space even with many established cloud infrastructure players in the space:

Business intelligence/analytics/SQL may end up converging with machine learning/code like Scala or Python in certain products, but these domains have different end users and communities, programming language preferences and technical skills. Generally, architectural lock-ins are a big point of fear within core infrastructure. This is true for end users with their cloud providers, storage solutions, compute engines, etc. Solutions will be heterogeneous because of that and technology that enables this flexibility will be important.
As data moves around today, it is being reprocessed in each platform, which at scale is inefficient and expensive. There is an opportunity to build technology that allows users to move data around without rewriting transformations, data pipelines and stored procedures.
Finally, we’re seeing more traction around general data processing frameworks that are not MapReduce under the hood, especially in the Python data science ecosystem. This is a transition from Hadoop or even Spark, since they aren’t always best suited for unstructured, more modern algorithms.

Derek Zanutto: The rise of the data lake model is creating a large, rapidly growing market opportunity adjacent to the more established data warehouse model typified by Snowflake and the big cloud vendors. The data lake model enables enterprises to unlock insights from a broader array of data (structured and unstructured) for a broader array of use cases (historical financial reporting and predictive AI analytics).

While the data lake offers many benefits for data-driven organizations, there are certain emerging challenges that will need to be addressed for the data lake model to further accelerate in the enterprise (for example, data reliability and highly performant querying). Companies that can build solutions to address these emerging pain points will have the opportunity to capture a large share of the significant profit pool being created around the data lake model.

Navin Chaddha: The increasing ubiquity of enterprise data lakes has led to a new breed of data-first apps, which will require tooling around data privacy, integrity, governance and access management. This ubiquity also brings an increasing focus on the workflows of the data engineering and science teams, including data quality/observability and process augmentation.

Jon Lehr: DataOps is a massive opportunity for startups building in the data lakes space. Data governance, lineage and creation of and serving data features are key to machine learning and have not yet been taken over by companies like Snowflake or other established giants. DataOps and MLOps are needed for successful implementation of scaling of machine learning operations.

Peter Wagner: First a note on terminology: There used to be clear definitional distinctions between things like data lakes, data warehouses, databases and the like. Now those lines have been blurred, sometimes for technical reasons as their capabilities extend and overlap, and sometimes for marketing reasons.

Within cloud data in general, we see exciting opportunities to complete the ecosystems around major platform players like Snowflake and AWS. A great example is Upsolver, a data preparation product that allows customers to use commodity cloud storage (such as S3) as a performant, functional data lake that integrates beautifully with powerful engines like Snowflake, Kafka and Presto. Upsolver can be thought of as a modern, cloud-native version of what we used to call ETL. Pepperdata is another example, bringing observability and optimization to cloud data.

Another compelling class of opportunity is driven by the rise of machine learning as a strategic workload. As ML-based production applications become more ubiquitous and mission-critical, there is an opportunity for startups to create specific tooling and infrastructure for ML workloads and their unique data types. Pinecone’s ML vector database is an example of such infrastructure. Truera is an example of ML-specific tooling, offering a model intelligence platform delivering AI explainability.

One trend we’ve taken note of is the “cloud-prem” model, as some are calling it, in which a managed cloud data service executes (at least partially) within the customer’s own VPC and cloud storage infrastructure. This model can have some strong economic and data privacy advantages. Hydrolix is an example of a company using this model to disrupt the observability data market, where cost and scaling concerns often drive customers to retain less data history than they would like. Upsolver uses the cloud-prem for its cloud-native data preparation service, which averts needless data copies, synchronizations and movement.

Finally, we are intrigued by opportunities to create specialized data platforms that serve the needs of specific industries or enterprise functions. These companies will increasingly be built on top of horizontal platforms from Snowflake and others while adding domain-specific integrations, data and process knowledge in their products. Segment can be thought of as an early example of this trend. SetSail is a newer example, serving the data-driven sales organization.

One of the most attractive aspects of the modern cloud data platform is its business model. Snowflake taught us the power of the well-designed managed service. Ease of adoption, rapid expansion avenues and consumption-based monetization combine to make these businesses extremely valuable. We expect these attributes will be hallmarks of the very best cloud data platforms.

Nicole Priel: Players like Snowflake have done a great job introducing fully elastic storage and compute, and introducing pay as you go to the data lake space. There are still many challenges with the ease of employees understanding the data in the warehouse and the accessibility of that information. There is a tremendous amount of innovation going on right now with companies like Panoply that are tackling the last mile of data analytics, bridging the gap between business users and the data lake in a no-code, automated way.

Ilya Sukhar: I would be hesitant to go head-on against the big cloud providers, Snowflake or Databricks right now. All of these products are in their ascendant phase and competition is fierce. There is undoubtedly a next generation of products waiting to be built, perhaps with a technical breakthrough that already exists in a research group somewhere, but I worry that the adoption cycle won’t be favorable for folks getting started right now.

Without going head-on, I would point to the difficulty of storing and organizing unstructured data like images, videos and other assets that get fed into machine learning processes. There are some approaches out there on the market but I still think it’s a very fruitful area for new ideas.

All that said, I am mostly spending my time looking at companies that are building products one layer above the warehouse/lake. They take the “modern data stack” as a prerequisite and imagine what is newly possible now that all of an enterprise’s data is centralized and organized in the warehouse with the help of products like Fivetran and dbt.

In prior generations it was cost prohibitive to gather and store all of the data that exists in an enterprise. And it was complex and slow to expose it to the application layer above. But now it is simple, cheap and fast.

So I’m excited by teams thinking about the cloud data warehouse as a platform shift that enables new applications. I think there will be an inevitable turnover in existing categories like business intelligence (e.g., competing with Looker) but also many new data-enabled applications we never imagined before.

What are the biggest challenges for startups entering the data lake market right now and how do they overcome them?

Caryn Marooney: A challenge is that the market already has great companies and winners. An opportunity is for startups to integrate with these great companies. New startups have an opportunity to build around the existing platforms and help develop a broader data ecosystem and stack around the query engine.

Dharmesh Thakker: For one thing, some startups have not figured out how to partner with larger platforms like Snowflake and Databricks early enough in their go-to-market planning. We feel strongly that these established vendors, including AWS, Azure and GCP are seeking to enable the data-infrastructure ecosystem by working with third-party platforms, versus owning the entire stack end to end. Early on in a company’s journey, it should find opportunities to work closely with the relevant counter-party product and go-to-market teams to find opportunities to integrate with the platforms and collaborate on sales so they are actively thought of as a complementary product offering.

While it is important for companies to have self-sufficient business models, we’ve seen traps where larger platforms will vouch for (and co-sell with) competitive products with their customers simply because of their knowledge of certain vendors versus others. Investing in these relationships early and often can be highly beneficial and synergistic.

Casey Aylward: One of the biggest challenges startups face is that there are a lot of players in the market pushing standards that try to lock you into solutions that work better with their compute engines. In cloud storage and computing, storage is inexpensive and data processing has become the real battleground. While cloud data warehouses dominate at the SQL prompt layer, this is beginning to move to the data lake with companies like Databricks and Starburst. To succeed in the data lake space, I think the best startups will serve specific use cases around data processing to position against these more general purpose platforms.

Derek Zanutto: Startups must help enterprises navigate what is essentially a market creation story by convincing buyers — who have spent the last 40 years procuring data warehouse technologies – to rethink their approach to data storage and analytics. Furthermore, the data lake category is intensively competitive – startups will often compete against well-entrenched data platforms who often own the underlying object storage infrastructures.

To succeed in the data lake market, startups will need to focus on product depth, not breadth. Companies that focus on building the “best of breed” technology for a core use case will have a better shot of scaling. From a go-to-market perspective, successful startups often sell solutions to specifically address tangible, real-world pain points (as opposed to broad technology solutions). Successful companies aim to secure small, paid pilots with real business impact and leverage those successful implementations as opportunities for future expansion.

Finally, open sourcing the underlying technology can be a highly successful way to gain both broad, bottoms-up distribution as well as mind share. Open sourcing also gives startups another opportunity for differentiation: It enables them to sell a multi and hybrid cloud story. This appeals to chief data officers who are increasingly looking to standardize onto open formats to give them the flexibility to leave the “walled gardens” of one cloud data platform for another.

Navin Chaddha: Three common pitfalls for data startups today:

Due to the lack of an open standard, startups need to integrate their product across several different data lake solutions.
Startups need to avoid becoming services companies, and thus focus on strong product differentiation.
It’s very important for startups to articulate their differentiation from incumbent platform vendors and ROI for potential customers.

Jon Lehr: Differentiation. Startups entering the data lake market right now need to answer: Are we faster? Are we cheaper? Do we offer more value? Do we do something that no other incumbent does (e.g., graph data management)? It’s interesting to look back and remember that Databricks differentiated themselves when they were starting out by “just” running Spark cheaper, faster and better on AWS than EMR.

Peter Wagner: Clutter is one of the major challenges for new entrants. There are many similar-sounding claims competing for mindshare from data teams. Elegant product definition and design that enables true product-led growth is the key to punching through the clutter.

It is also critically important to achieve the right alignment and leverage with the major cloud data players. Startups need to identify which platforms will be their go-to partners, and craft the right fit in terms of product and go-to-market strategies with these partners.

Nicole Priel: Some of the biggest challenges around the data lake continues to be data reliability and query performance.

What impact do the big cloud vendors have on the data lake market with their offerings?

Caryn Marooney: The big cloud vendors are the elephant in the room in any software infrastructure conversation. They will always offer competitive product and have extensive resources to deploy against them. But we have seen time and time again that there is opportunity for innovative startups to succeed in the market by delivering innovative software and exceptional customer experiences.

Dharmesh Thakker: They have a big impact, but customer reaction can be varied. The large cloud vendors want to be the single source of truth for all storage and compute needs across the data landscape, so understanding their value proposition and product direction is important to any company starting its journey in the space. Ultimately, AWS, Azure and GCP have the resources to bring pricing down significantly for customers; however, there are two main drawbacks to customers working with these major players.

Given the significant data sprawl across cloud and on-premise workloads, customers will often seek a platform-agnostic approach that works across cloud vendors versus being tightly coupled within one. Similarly, tying customers’ toolbox to one of these cloud vendors will eventually lead to vendor lock-in, whereby the cloud providers will have significant pricing power and control over infrastructure needs.

For smaller customers who are looking for a simplistic approach, an end-to-end stack from a cloud vendor may prove to be worthwhile; however, as customers mature they often will look for solutions that work across platforms.

Casey Aylward: Cloud vendors make money with their data lake storage solutions (Azure’s Data Lake Storage and Amazon’s S3). Since they own the very bottom of the stack, it’s a logical place to layer in additional value with data management tools and data processing (SQL or code based). Because big cloud vendors have the benefit of owning and monetizing storage, they have the power to underprice the market on other features and immediate distribution.

In the open-source community, there are common data formats emerging like Apache Arrow that will significantly reduce data transfer costs and make interoperability across platforms and languages possible. This is one enabling technology that will help avoid lock-in with one platform.

Derek Zanutto: The big cloud vendors are developing and selling end-to-end ecosystems around their data lake offerings. Today they primarily offer the underlying object storage infrastructure on top of which data lakes are built, as well as data integration tools to move data into the data lake. They also offer a variety of data services, such as data science notebooks and federated SQL query engines, in order to make data stored in the data lake accessible for data consumers. Through their comprehensive platform approach, cloud players offer “one-stop shops” for all the data services enterprises need, often at compelling price points, since they’re able to bundle data services with core infrastructure spend.

Additionally, their services integrate seamlessly with other technologies in their cloud portfolios, providing what can be a highly compelling value proposition to many enterprises. As a result, the cloud vendors have had a tremendous impact on the data lake market. In order to compete with the cloud players, data lake companies must drive real product innovation and differentiation. In some categories, this could take the form of a verticalized solution that better addresses the pain points of particular industries and buyers.

Navin Chaddha: The big cloud vendors have in many ways already commoditized the infrastructure market for data lakes; they also continue to provide value up the stack by integrating deeply with applications along with adding a layer of business intelligence and visualization tools.

Jon Lehr: Big cloud vendors’ lower cost and credibility are helping shift the standard from on-premise to cloud. Right now, one of the biggest challenges for the data lakes market is actually getting data into the cloud. However, as a consequence of this shift, once data is in the cloud it opens up the opportunity to best-in-class solutions.

Peter Wagner: The big cloud vendors are a double-edged sword. They are enablers of the fundamental opportunity in cloud data in a very real way. They also constitute formidable direct competition. Snowflake has navigated this duality to build an amazing business that is both a customer and competitor of the major cloud service providers.

The major cloud data players also create opportunity for ecosystem development. Their very success has exposed unfilled gaps that startups are rushing to fill. Take a look at the classical enterprise data market. All those niches (and some new ones!) will be filled by cloud-native companies with modern architectures and more attractive business models. Companies that fill the right ecosystem needs for the right partners can experience tremendous go-to-market lift.

Nicole Priel: For data lake storage, most companies starting out or looking to migrate are considering Snowflake and Google BigQuery. It’s important that any startups in the data ecosystem are compatible with these players.

Ilya Sukhar: They all rightly see the data warehouse (or data lake, whatever term you prefer) as extremely strategic. So they’re doing a nice job competing on price and also bundling the warehouse as part of broader deals with big customers.

Beyond data lakes, there are lots of adjacent services with data governance, preparation, management and getting it in and out of the data lake? What kind of startup opportunity do you see in these adjacent markets?

Caryn Marooney: The data prep, management and governance opportunity is in its earliest innings. As the broader data ecosystem matures, these solutions will become more and more widely adopted. Especially for larger enterprises, these solutions become very important at scale.

Dharmesh Thakker: As mentioned previously, there is a significant ecosystem that has already emerged around data lakes and data warehouses in the form of “unification layers,” and we think it’s a very big opportunity for startups. We have seen success from cloud ELT (extract, load, transform) companies such as Matillion, Fivetran, Streamsets and dbt, which have built solutions for replicating data from cloud vendors like Salesforce, Workday, NetSuite, etc. and then transforming the data for analytics once it resides in its relevant lake or warehouse.

Other companies, including Confluent and Rockset, are focused on real-time, streaming analytics. In addition, once data is loaded into its relevant destination, data catalog, data lineage, data governance, data security and data monitoring have become priorities for all organizations in order to extract business value from the data silos.

Casey Aylward: Further up the stack, there is an opportunity to create a unified layer at the top that brings all of these data repositories together so users across the organization can access the data, including nontechnical users. We believe collaborative notebooks will be a good solution since they are language agnostic.

In addition, data reliability will be critical for operational use cases for businesses like artificial intelligence and machine learning. This has an effect on customer-facing applications and critical decisions so there is a big opportunity in making sure the data is monitored the same way other core parts of infrastructure or applications are monitored.

Derek Zanutto: There is tremendous opportunity for startups solving data governance and management challenges, whether or not they deal with data lakes. We’ve found through extensive conversations with chief data officers that their largest pain points center around data quality, data governance and time-to-insight.

In the area of data quality, one in three business users and consumers of data surveyed didn’t trust the data their teams were using. This is due to widespread inconsistency in data usage and terminology across teams, as well as a lack of a single source of truth. The lack of standardization creates inconsistent and inaccurate data models. These in turn lead to inconsistent and inaccurate model outputs that lead to imperfect business decisions that ultimately hurt the bottom line. Startups that help enterprises solve data quality challenges will be uniquely positioned to capture both interest and budget from investors and potential customers alike.

With regards to data governance, regulatory pressures from increasingly stringent legislation (e.g., GDPR, CCPA, etc.) have made the protection and privacy of consumer data a top priority within most global enterprises. Despite the prioritization, 80% of the chief data officers we surveyed do not consider themselves to be sufficiently GDPR compliant today. In fact, many enterprises lack the systems necessary to answer even the most basic questions of their data — what data they have, where it came from, where it has been, what it is being used for and how it may be impacted in the event of a data breach. Startups that help enterprises better understand their data history and potential impact are becoming increasingly mission-critical in most corporate board rooms.

Finally, with regard to time-to-insights, most enterprises are experiencing a massive proliferation of data coupled with significant fragmentation of that data across silos. These challenges make it difficult for consumers of data such as data analysts and data scientists to fully utilize the data they have and to drive it toward actionable insights. In fact, the business analysts we have surveyed spend on average 50% of their time simply looking for the right data for their analyses. We believe there’s a tremendous market opportunity for startups that will significantly reduce time-to-insight for enterprises; the winning startups will achieve this by democratizing self-serve access to data for all users, augmenting existing business intelligence workflows using no-code and NLP, and automatically surfacing insights to business users.

Navin Chaddha: Here are some opportunities for startups in adjacent markets:

Governance: privacy consent, access control, compliance.
Preparation: data quality, sync/integration, stream processing.
Management: data ops, exploratory analysis, workflow automation.

Jon Lehr: We’re already seeing traction in a lot of these adjacent markets as forward-thinking organizations have been building upon their internal data management practice for years now. MLOps management platform and Work-Bench portfolio company Algorithmia, is a great example of tech capitalizing on the demand for data solutions — their software manages all stages of the ML lifecycle within existing operational processes, making sure models are put into production quickly, securely and cost-effectively.

While DataOps isn’t necessarily new, it’s often thought of in the wrong way. It’s abilities go beyond DevOps for data — DevOps is only part of the bigger data analytics picture, whereas DataOps completes the picture by automating and reducing the full end-to-end cycle time of data analytics.

Peter Wagner: Adjacent ecosystem opportunities are where a lot of exciting new companies are being built. We are particularly excited by data preparation (e.g., Upsolver) and data stack observability (e.g., Pepperdata), as well as optimized infrastructure for managing ML (e.g., Pinecone) and observability (e.g., Hydrolix) workloads. Technologies that enable data privacy and sharing also offer important opportunities where some major companies might be built.

Nicole Priel: I believe that the next challenge is addressing the last-mile problem. There is still a gap between the final end users of the data and the gatekeepers who truly understand what’s in the data lake. Simplifying discoverability and the use of ML to auto-generate insights will help close that gap.

Ilya Sukhar: I continue to be very excited about the impact Fivetran is having in this ecosystem as the primary way customers move data from their SaaS products or internal databases into the warehouse. And there’s a lot more value to deliver on top in terms of making that data better organized and more accessible. Going further, I’m particularly excited about data-enabled applications that can be built on top of “Powered by Fivetran,” which is a Plaid-like API platform for getting programmatic access to all of a company’s data.

Looking outside of Fivetran, ever since the Snowflake IPO there’s been a frenzy of rich financing rounds for talented teams going after well-known problems in data — governance, quality, monitoring, cataloguing, integration, etc. All good ideas but each category has at least a few well-financed upstarts. So it’s going to be a bit crowded out there for a while.