Startups

The market for synthetic data is bigger than you think

Comment

Holographic human type AI robot and programming data on a black background.
Image Credits: Yuichiro Chino / Getty Images

“By 2024, 60% of the data used for the development of AI and analytics projects will be synthetically generated.” This is a prediction from Gartner that you will find in almost every single article, deck or press release related to synthetic data.

We are repeating this quote here despite its ubiquity because it says a lot about the total addressable market of synthetic data.

Let’s unpack: First, describing synthetic data that is “synthetically generated” may seem tautologic, but it is also quite clear: We are talking about data that is artificial/fake and created, rather than gathered in the real world.

Next, there’s the core of the prediction — that synthetic data will be used in the development of most AI and analytics projects. Since such projects are on the rise, the correlation is that the market for synthetic data is also set to grow.

Last but not least is the time horizon. In our startup world, 2024 is almost today, and people at Gartner already have a longer-term prediction: Some of its team published a piece of research “Forget About Your Real Data — Synthetic Data Is the Future of AI.”

“The future of AI” is the kind of promise that investors like to hear, so it’s no surprise that checks have been flowing into synthetic data startups.

In 2022 alone, MOSTLY AI raised a $25 million Series B round led by Molten Ventures; Datagen landed a $50 million Series B led by Scale Venture Partners, and Synthesis AI pocketed a $17 million Series A.

Synthetic data startups that have raised significant amounts of funding already serve a wide range of sectors, from banking and healthcare to transportation and retail. But they expect use cases to keep on expanding, both inside new sectors as well as those where synthetic data is already common.

To understand what’s happening, but also what’s coming if synthetic data does get more broadly adopted, we talked to various CEOs and VCs over the last few months. We learned about the two main categories of synthetic data companies, which sectors they address, how to size the market and more.

The tip of the iceberg

Quiet Capital’s founding partner, Astasia Myers, is one of the investors bullish about synthetic data and its applications. She declined to disclose whether she invested in this space, but said that “there’s a lot to be excited about in the synthetic data world.”

Why the enthusiasm? “Because it gives teams faster access to data in a secure way at a lower cost,” she told TechCrunch.

Access to large troves of data has become critical for machine learning teams, and real data is often not up to the task, for different reasons. This is the gap that synthetic data startups are hoping to fill.

There are two main contexts in which these startups focus: structured data and unstructured data. The former refers to the kind of datasets that sit in tables and spreadsheets, while the latter points toward what we could call media files, such as audio, text and visual data.

“It makes sense to distinguish between structured and unstructured synthetic data companies,” Myers said, “because the synthetic data type is applied to different use cases and therefore different buyers.”

According to MOSTLY AI CEO Tobias Hann, most of the demand for structured synthetic data comes from banking, insurance, telecommunications and healthcare companies.

These four highly regulated sectors are attracted by the possibility of plentiful — yet privacy-preserving — data. Whether synthetic data can deliver this or not is still somewhat controversial, but several companies think so, as do their investors.

Recently funded companies in this structured data vertical include MOSTLY AI; Tonic.ai, which raised a $35 million Series B funding round; and Gretel AI, which closed a $50 million Series B round last October. A fuller list and market map can be found in this Medium post by synthetic data advocate Elise Devaux, whose employer Statice is also a competitor.

As captured by Devaux, the unstructured synthetic data side of the market is represented by a whole other range of companies, such as the above-mentioned Datagen and Synthesis AI. Some, such as Parallel Domain, appeared a few years ago, while others, such as Scale AI, entered the space more recently. But they have one thing in common: They have no need to envy their structured data peers when it comes to attracting funding or clients.

Visual synthetic data, for instance, has a variety of use cases. According to Datagen CEO Ofir Zuk (Chakon), four of these are accelerating faster than others: AR/VR/metaverse, in-cabin automotive and automotive in general, smart conferencing and home security.

However, Datagen is also making sure its algorithms and technology are domain agnostic to ensure it will be ready when synthetic data usage takes off in other sectors, such as retail and robotics. And there’s little reason to doubt that it will be the case.

An ongoing democratization

“Deep learning with synthetic data will democratize the tech industry,” general partner at LDV Capital Evan Nisselson predicted in a 2018 TechCrunch guest column.

By democratization, Nisselson meant leveling the playing field. By using synthetic data, startups would be able to do applied machine learning without the type of big data that only large tech companies had at that point.

Nisselson’s prediction both held up and didn’t. Synthetic data helped underdogs in their David-Goliath fight. But now, Meta and the like want to have their cake and eat it, too. Nisselson acknowledges as much.

In 2018, “a lot of the people at big companies said: ‘Evan, we have more data than we need, we don’t need to make synthetic data.’” The only exception was for addressing edge cases in applications such as autonomous vehicles.

“But many things have changed, and I think more and more [big companies] will leverage synthetic media,” Nisselson said.

Synthetic data has selling points that can appeal to companies of all sizes. “It’s a faster and sometimes cheaper way to train systems,” Nisselson said. This is in part because the data is itself generated with AI, for instance via generative adversarial networks (GANs). But it is also because costly steps such as annotation and labeling could potentially be skipped.

If one buys into the bullish hypothesis on synthetic data becoming the main form of training data, it becomes easy to calculate its total addressable market (TAM), Zuk said. “We can simply say that the TAM of synthetic data and the TAM of data will converge.”

This market could even be bigger than expected if new use cases are unlocked within companies. “We are just starting to see the early innings of synthetic data’s role in organizations,” Myers said.

It is not clear yet how this democratization within organizations will happen, but a low-code or no-code approach will likely help. MOSTLY AI’s strategy, for example, is very much focused on unlocking business value for clients that include Fortune 100 banks and insurers, as well as telcos. Because of this, the startup created a platform that can not only be used by data scientists, but also by software engineers and quality testers.

What’s next?

Some applications of synthetic data are still novel for the general public, but already too late for a VC firm like LDV, whose thesis is to invest as early as possible in teams leveraging cutting-edge technology.

This makes Nisselson interesting to talk to, keeping that bias in mind. For instance, he describes synthetic data applied to transportation or to robotics — such as training robots to pick up items in warehouses or factory lines — as “crowded” already.

Use cases highlighted by Nisselson also included training autonomous retail, like AiFi does; marketing, advertising, and e-learning, like his portfolio company Synthesia, which uses synthetic avatars for corporate training; gaming, including the metaverse; and what he calls “pure content creation.”

Myers too highlighted the use of synthetic data in content creation. “Synthetic data is most affiliated with deepfakes that replace one person’s likeness or voice with synthetic versions,” she said, but “these technologies can also be applied to commercial applications.”

One thing is for sure: With startups blossoming and use cases aplenty, some of these companies will eventually buy each other or get acquired by tech giants. Last October, even before rebranding as Meta, Facebook already quietly acquired AI.Reverie. With other large companies partnering with synthetic data startups more or less openly, and considering the current venture capital climate, we would be surprised if some of these didn’t lead to M&As in the near future. It’s something that we will definitely be tracking.

More TechCrunch

The Series C funding, which brings its total raise to around $95 million, will go toward mass production of the startup’s inaugural products

AI chip startup DEEPX secures $80M Series C at a $529M valuation 

A dust-up between Evolve Bank & Trust, Mercury and Synapse has led TabaPay to abandon its acquisition plans of troubled banking-as-a-service startup Synapse.

Infighting among fintech players has caused TabaPay to ‘pull out’ from buying bankrupt Synapse

The problem is not the media, but the message.

Apple’s ‘Crush’ ad is disgusting

The Twitter for Android client was “a demo app that Google had created and gave to us,” says Particle co-founder and ex-Twitter employee Sara Beykpour.

Google built some of the first social apps for Android, including Twitter and others

WhatsApp is updating its mobile apps for a fresh and more streamlined look, while also introducing a new “darker dark mode,” the company announced on Thursday. The messaging app says…

WhatsApp’s latest update streamlines navigation and adds a ‘darker dark mode’

Plinky lets you solve the problem of saving and organizing links from anywhere with a focus on simplicity and customization.

Plinky is an app for you to collect and organize links easily

The keynote kicks off at 10 a.m. PT on Tuesday and will offer glimpses into the latest versions of Android, Wear OS and Android TV.

Google I/O 2024: How to watch

For cancer patients, medicines administered in clinical trials can help save or extend lives. But despite thousands of trials in the United States each year, only 3% to 5% of…

Triomics raises $15M Series A to automate cancer clinical trials matching

Welcome back to TechCrunch Mobility — your central hub for news and insights on the future of transportation. Sign up here for free — just click TechCrunch Mobility! Tap, tap.…

Tesla drives Luminar lidar sales and Motional pauses robotaxi plans

The newly announced “Public Content Policy” will now join Reddit’s existing privacy policy and content policy to guide how Reddit’s data is being accessed and used by commercial entities and…

Reddit locks down its public data in new content policy, says use now requires a contract

Eva Ho plans to step away from her position as general partner at Fika Ventures, the Los Angeles-based seed firm she co-founded in 2016. Fika told LPs of Ho’s intention…

Fika Ventures co-founder Eva Ho will step back from the firm after its current fund is deployed

In a post on Werner Vogels’ personal blog, he details Distill, an open-source app he built to transcribe and summarize conference calls.

Amazon’s CTO built a meeting-summarizing app for some reason

Paris-based Mistral AI, a startup working on open source large language models — the building block for generative AI services — has been raising money at a $6 billion valuation,…

Sources: Mistral AI raising at a $6B valuation, SoftBank ‘not in’ but DST is

You can expect plenty of AI, but probably not a lot of hardware.

Google I/O 2024: What to expect

Dating apps and other social friend-finders are being put on notice: Dating app giant Bumble is looking to make more acquisitions.

Bumble says it’s looking to M&A to drive growth

When Class founder Michael Chasen was in college, he and a buddy came up with the idea for Blackboard, an online classroom organizational tool. His original company was acquired for…

Blackboard founder transforms Zoom add-on designed for teachers into business tool

Groww, an Indian investment app, has become one of the first startups from the country to shift its domicile back home.

Groww joins the first wave of Indian startups moving domiciles back home from US

Technology giant Dell notified customers on Thursday that it experienced a data breach involving customers’ names and physical addresses. In an email seen by TechCrunch and shared by several people…

Dell discloses data breach of customers’ physical addresses

Featured Article

Fairgen ‘boosts’ survey results using synthetic data and AI-generated responses

The Israeli startup has raised $5.5M for its platform that uses “statistical AI” to generate synthetic data that it says is as good as the real thing.

16 hours ago
Fairgen ‘boosts’ survey results using synthetic data and AI-generated responses

Hydrow, the at-home rowing machine maker, announced Thursday that it has acquired a majority stake in Speede Fitness, the company behind the AI-enabled strength training machine. The rowing startup also…

Rowing startup Hydrow acquires a majority stake in Speede Fitness as their CEO steps down

Call centers are embracing automation. There’s debate as to whether that’s a good thing, but it’s happening — and quite possibly accelerating. According to research firm TechSci Research, the global…

Retell AI lets companies build ‘voice agents’ to answer phone calls

TikTok is starting to automatically label AI-generated content that was made on other platforms, the company announced on Thursday. With this change, if a creator posts content on TikTok that…

TikTok will automatically label AI-generated content created on platforms like DALL·E 3

India’s mobile payments regulator is likely to extend the deadline for imposing market share caps on the popular UPI (unified payments interface) payments rail by one to two years, sources…

India likely to delay UPI market caps in win for PhonePe-Google Pay duopoly

Line Man Wongnai, an on-demand food delivery service in Thailand, is considering an initial public offering on a Thai exchange or the U.S. in 2025.

Thai food delivery app Line Man Wongnai weighs IPO in Thailand, US in 2025

Ever wonder why conversational AI like ChatGPT says “Sorry, I can’t do that” or some other polite refusal? OpenAI is offering a limited look at the reasoning behind its own…

OpenAI offers a peek behind the curtain of its AI’s secret instructions

The federal government agency responsible for granting patents and trademarks is alerting thousands of filers whose private addresses were exposed following a second data spill in as many years. The…

US Patent and Trademark Office confirms another leak of filers’ address data

As part of an investigation into people involved in the pro-independence movement in Catalonia, the Spanish police obtained information from the encrypted services Wire and Proton, which helped the authorities…

Encrypted services Apple, Proton and Wire helped Spanish police identify activist

Match Group, the company that owns several dating apps, including Tinder and Hinge, released its first-quarter earnings report on Tuesday, which shows that Tinder’s paying user base has decreased for…

Match looks to Hinge as Tinder fails

Private social networking is making a comeback. Gratitude Plus, a startup that aims to shift social media in a more positive direction, is expanding its wellness-focused, personal reflections journal to…

Gratitude Plus makes social networking positive, private and personal

With venture totals slipping year-over-year in key markets like the United States, and concern that venture firms themselves are struggling to raise more capital, founders might be worried. After all,…

Can AI help founders fundraise more quickly and easily?