AI

For companies that use ML, labeled data is the key differentiator

Comment

Data labeling is more important than ever for ML implementations
Image Credits: gremlin / Getty Images

Sylvain Kalache

Contributor

Sylvain Kalache is the co-founder of Holberton, an edtech company training digital talent in more than 10 countries. An entrepreneur and software engineer, he has worked in the tech industry for more than a decade. Part of the team that led SlideShare to be acquired by LinkedIn, he has written for CIO and VentureBeat.

More posts from Sylvain Kalache

AI is driving the paradigm shift that is the software industry’s transition to data-centric programming from writing logical statements. Data is now oxygen. The more training data a company gathers, the brighter will its AI-powered products burn.

Why is Tesla so far ahead with advanced driver assistance systems (ADAS)? Because no one else has collected as much information — it has data on more than 10 billion driven miles, helping it pull ahead of competition like Waymo, which has only about 20 million miles. But any company that is considering using machine learning (ML) cannot overlook one technical choice: supervised or unsupervised learning.

There is a fundamental difference between the two. For unsupervised learning, the process is fairly straightforward: The acquired data is directly fed to the models, and if all goes well, it will identify patterns.

Elon Musk compares unsupervised learning to the human brain, which gets raw data from the six senses and makes sense of it. He recently shared that making unsupervised learning work for ADAS is a major challenge that hasn’t been solved yet.

Supervised learning is currently the most practical approach for most ML challenges. O’Reilly’s 2021 report on AI Adoption in the Enterprise found that 82% of surveyed companies use supervised learning, while only 58% use unsupervised learning. Gartner predicts that through 2022, supervised learning will remain favored by enterprises, arguing that “most of the current economic value gained from ML is based on supervised learning use cases.”

Supervised learning requires the crucial additional step of making raw data smart by labeling it. If we take the example of Tesla’s ADAS, a human has looked at and labeled pretty much every object in every image in all that training data to identify people, traffic signs, other vehicles, etc.

“Raw data, while plentiful and in theory, useful, cannot typically be used by an ML system without modification and preparation,” writes Peter Levine, a partner at venture firm Andreessen Horowitz. “Before being fed into an ML framework like PyTorch or Tensorflow, data has to be aggregated, transformed, cleaned, augmented, and — in most cases — labeled.”

5 machine learning essentials nontechnical leaders need to understand

It turns out that data labeling can take up to 80% of the resources in the average ML project. It’s also a big source of failure: 70% of companies report having problems labeling their data. To date, data labeling has been a brute force affair — the more Mechanical Turk workers or the larger the annotation farms a company throws at the problem, the faster it gets done. The cost and speed of iteration are linear to the number of workers the company can hire. In other words, it does not scale well.

However, AI itself has a solution to this problem: Leveraging ML to pre-label the raw data so workers only have to confirm what the computer has done. Human labelers can then focus on edge cases, making the process faster and cheaper.

It’s been more than five years since computers started to beat humans at image recognition, but the industry has only recently started booming. The data annotation market, which was only valued at $695.5 million in 2019, is expected to surpass $6 billion by 2027.

One of the main players in the space is Scale AI. Its recent $325 million round of funding brought the company to a whopping $7 billion valuation. In its fundraising announcement, the company said it was able to improve Toyota’s annotation throughput by 10 times in a matter of weeks. Toyota AI Ventures senior partner Chris Abshire defined the ability to “easily obtain data, and then extract value from that data with minimal human intervention” as the holy grail for many AI startups.

Data annotation also applies to more traditional industries. Blue River Technology, John Deere’s AI subsidiary, is also using supervised learning to improve John Deere smart sprayers’ ability to tell the difference between a weed and a crop. Using the Labelbox platform, one of Scale.ai’s main competitors, Blue River Technology, was able to cut its labeling time nearly in half, speeding up iteration while also saving money. “Over the course of 2020, we were able to lower our cost per label by 25%,” says Emma Bassein, Blue River’s director of data and machine learning.

Scale and Labelbox, the largest players in the field, represent different approaches to the labeling problem. Scale is an example of approaching it from a service perspective, where it takes data from its customers and returns it labeled, relieving companies of the task altogether. This approach is popular among enterprises that require large-scale training data sets — primarily self-driving car companies.

Labelbox is an example of a platform perspective that gives data owners the tools to annotate their data without giving up control. The platform approach is more popular among companies that depend primarily on quality rather than quantity in their training data.

Data quality comes as the second-biggest challenge for companies doing AI, and labeling data is a way to assess it. Data quality encompasses a number of elements, including volume, diversity, accuracy and bias. For instance, with ADAS technology, if there aren’t enough images of rainy conditions, the model won’t work well in a storm.

A good training data platform can identify and fix this problem before the model goes into production and a car crashes in the rain. The labeling process can also identify biases in the data, which would otherwise train your model to be racist or sexist — Amazon’s recruiting model discriminating against women is one such disastrous failure caused by poor data quality.

When a company chooses supervised learning, it needs to have a strategy that allows it to label data as quickly as it acquires it. It has been essential for software companies to hire top software talent to write the best lines of code, but the new paradigm will be to generate the smartest data to come up with the best AI models.

How we dodged risks and raised millions for our open-source machine learning startup

More TechCrunch

Companies are always looking for an edge, and searching for ways to encourage their employees to innovate. One way to do that is by running an internal hackathon around a…

Why companies are turning to internal hackathons

Featured Article

I’m rooting for Melinda French Gates to fix tech’s broken ‘brilliant jerk’ culture

Women in tech still face a shocking level of mistreatment at work. Melinda French Gates is one of the few working to change that.

6 hours ago
I’m rooting for Melinda French Gates to fix tech’s  broken ‘brilliant jerk’ culture

Blue Origin has successfully completed its NS-25 mission, resuming crewed flights for the first time in nearly two years. The mission brought six tourist crew members to the edge of…

Blue Origin successfully launches its first crewed mission since 2022

Creative Artists Agency (CAA), one of the top entertainment and sports talent agencies, is hoping to be at the forefront of AI protection services for celebrities in Hollywood. With many…

Hollywood agency CAA aims to help stars manage their own AI likenesses

Expedia says Rathi Murthy and Sreenivas Rachamadugu, respectively its CTO and senior vice president of core services product & engineering, are no longer employed at the travel booking company. In…

Expedia says two execs dismissed after ‘violation of company policy’

Welcome back to TechCrunch’s Week in Review. This week had two major events from OpenAI and Google. OpenAI’s spring update event saw the reveal of its new model, GPT-4o, which…

OpenAI and Google lay out their competing AI visions

When Jeffrey Wang posted to X asking if anyone wanted to go in on an order of fancy-but-affordable office nap pods, he didn’t expect the post to go viral.

With AI startups booming, nap pods and Silicon Valley hustle culture are back

OpenAI’s Superalignment team, responsible for developing ways to govern and steer “superintelligent” AI systems, was promised 20% of the company’s compute resources, according to a person from that team. But…

OpenAI created a team to control ‘superintelligent’ AI — then let it wither, source says

A new crop of early-stage startups — along with some recent VC investments — illustrates a niche emerging in the autonomous vehicle technology sector. Unlike the companies bringing robotaxis to…

VCs and the military are fueling self-driving startups that don’t need roads

When the founders of Sagetap, Sahil Khanna and Kevin Hughes, started working at early-stage enterprise software startups, they were surprised to find that the companies they worked at were trying…

Deal Dive: Sagetap looks to bring enterprise software sales into the 21st century

Keeping up with an industry as fast-moving as AI is a tall order. So until an AI can do it for you, here’s a handy roundup of recent stories in the world…

This Week in AI: OpenAI moves away from safety

After Apple loosened its App Store guidelines to permit game emulators, the retro game emulator Delta — an app 10 years in the making — hit the top of the…

Adobe comes after indie game emulator Delta for copying its logo

Meta is once again taking on its competitors by developing a feature that borrows concepts from others — in this case, BeReal and Snapchat. The company is developing a feature…

Meta’s latest experiment borrows from BeReal’s and Snapchat’s core ideas

Welcome to Startups Weekly! We’ve been drowning in AI news this week, with Google’s I/O setting the pace. And Elon Musk rages against the machine.

Startups Weekly: It’s the dawning of the age of AI — plus,  Musk is raging against the machine

IndieBio’s Bay Area incubator is about to debut its 15th cohort of biotech startups. We took special note of a few, which were making some major, bordering on ludicrous, claims…

IndieBio’s SF incubator lineup is making some wild biotech promises

YouTube TV has announced that its multiview feature for watching four streams at once is now available on Android phones and tablets. The Android launch comes two months after YouTube…

YouTube TV’s ‘multiview’ feature is now available on Android phones and tablets

Featured Article

Two Santa Cruz students uncover security bug that could let millions do their laundry for free

CSC ServiceWorks provides laundry machines to thousands of residential homes and universities, but the company ignored requests to fix a security bug.

2 days ago
Two Santa Cruz students uncover security bug that could let millions do their laundry for free

TechCrunch Disrupt 2024 is just around the corner, and the buzz is palpable. But what if we told you there’s a chance for you to not just attend, but also…

Harness the TechCrunch Effect: Host a Side Event at Disrupt 2024

Decks are all about telling a compelling story and Goodcarbon does a good job on that front. But there’s important information missing too.

Pitch Deck Teardown: Goodcarbon’s $5.5M seed deck

Slack is making it difficult for its customers if they want the company to stop using its data for model training.

Slack under attack over sneaky AI training policy

A Texas-based company that provides health insurance and benefit plans disclosed a data breach affecting almost 2.5 million people, some of whom had their Social Security number stolen. WebTPA said…

Healthcare company WebTPA discloses breach affecting 2.5 million people

Featured Article

Microsoft dodges UK antitrust scrutiny over its Mistral AI stake

Microsoft won’t be facing antitrust scrutiny in the U.K. over its recent investment into French AI startup Mistral AI.

2 days ago
Microsoft dodges UK antitrust scrutiny over its Mistral AI stake

Ember has partnered with HSBC in the U.K. so that the bank’s business customers can access Ember’s services from their online accounts.

Embedded finance is still trendy as accounting automation startup Ember partners with HSBC UK

Kudos uses AI to figure out consumer spending habits so it can then provide more personalized financial advice, like maximizing rewards and utilizing credit effectively.

Kudos lands $10M for an AI smart wallet that picks the best credit card for purchases

The EU’s warning comes after Microsoft failed to respond to a legally binding request for information that focused on its generative AI tools.

EU warns Microsoft it could be fined billions over missing GenAI risk info

The prospects for troubled banking-as-a-service startup Synapse have gone from bad to worse this week after a United States Trustee filed an emergency motion on Wednesday.  The trustee is asking…

A US Trustee wants troubled fintech Synapse to be liquidated via Chapter 7 bankruptcy, cites ‘gross mismanagement’

U.K.-based Seraphim Space is spinning up its 13th accelerator program, with nine participating companies working on a range of tech from propulsion to in-space manufacturing and space situational awareness. The…

Seraphim’s latest space accelerator welcomes nine companies

OpenAI has reached a deal with Reddit to use the social news site’s data for training AI models. In a blog post on OpenAI’s press relations site, the company said…

OpenAI inks deal to train AI on Reddit data

X users will now be able to discover posts from new Communities that are trending directly from an Explore tab within the section.

X pushes more users to Communities

For Mark Zuckerberg’s 40th birthday, his wife got him a photoshoot. Zuckerberg gives the camera a sly smile as he sits amid a carefully crafted re-creation of his childhood bedroom.…

Mark Zuckerberg’s makeover: Midlife crisis or carefully crafted rebrand?