Featured Article

Image-generating AI can copy and paste from training data, raising IP concerns

A new study shows Stable Diffusion and like models replicate data

Comment

Stable Diffusion
Image Credits: Bryce Durbin / TechCrunch

Image-generating AI models like DALL-E 2 and Stable Diffusion can — and do — replicate aspects of images from their training data, researchers show in a new study, raising concerns as these services enter wide commercial use.

Co-authored by scientists at the University of Maryland and New York University, the research identifies cases where image-generating models, including Stable Diffusion, “copy” from the public internet data — including copyrighted images — on which they were trained.

To be clear, the study hasn’t been peer reviewed yet. A researcher in the field, who asked not to be identified by name, shared high-level thoughts with TechCrunch via email.

“Even though diffusion models such as Stable Diffusion produce beautiful images, and often ones that appear highly original and custom tailored to a particular text prompt, we show that these images may actually be copied from their training data, either wholesale or by copying only parts of training images,” the researcher said. “Companies generating data with diffusion models may need to reconsider wherever intellectual property laws are concerned. It is virtually impossible to verify that any particular image generated by Stable Diffusion is novel and not stolen from the training set.”

Images from noise

State-of-the-art image-generating systems like Stable Diffusion are what’s known as “diffusion” models. Diffusion models learn to create images from text prompts (e.g., “a sketch of a bird perched on a windowsill”) as they work their way through massive training data sets. The models — trained to “re-create” images as opposed to drawing them from scratch — start with pure noise and refine an image over time to make it incrementally closer to the text prompt.

It’s not very intuitive tech. But it’s exceptionally good at generating artwork in virtually any style, including photorealistic art. Indeed, diffusion has enabled a host of attention-grabbing applications, from synthetic avatars in Lensa to art tools in Canva. DeviantArt recently released a Stable Diffusion–powered app for creating custom artwork, while Microsoft is tapping DALL-E 2 to power a generative art feature coming to Microsoft Edge.

Stable Diffusion copying
On the top are images generated by Stable Diffusion from random captions in the model’s training set. On the bottom are images that the researchers prompted to match the originals. Image Credits: Somepalli et al.

To be clear, it wasn’t a mystery that diffusion models replicate elements of training images, which are usually scraped indiscriminately from the web. Character designers like Hollie Mengert and Greg Rutkowski, whose classical painting styles and fantasy landscapes have become one of the most commonly used prompts in Stable Diffusion, have decried what they see as poor AI imitations that are nevertheless tied to their names.

But it’s been difficult to empirically measure how often copying occurs, given diffusion systems are trained on upward of billions of images that come from a range of different sources.

To study Stable Diffusion, the researchers’ approach was to randomly sample 9,000 images from a data set called LAION-Aesthetics — one of the image sets used to train Stable Diffusion — and the images’ corresponding captions. LAION-Aesthetics contains images paired with text captions, including images of copyrighted characters (e.g., Luke Skywalker and Batman), images from IP-protected sources such as iStock, and art from living artists such as Phil Koch and Steve Henderson.

The researchers fed the captions to Stable Diffusion to have the system create new images. They then wrote new captions for each, attempting to have Stable Diffusion replicate the synthetic images. After comparing using an automated similarity-spotting tool, the two sets of generated images — the set created from the LAION-Aesthetics captions and the set from the researchers’ prompts — the researchers say they found a “significant amount of copying” by Stable Diffusion across the results, including backgrounds and objects recycled from the training set.

One prompt — “Canvas Wall Art Print” — consistently yielded images showing a particular sofa, a comparatively mundane example of the way diffusion models associate semantic concepts with images. Others containing the words “painting” and “wave” generated images with waves resembling those in the painting “The Great Wave off Kanagawa” by Katsushika Hokusai.

Across all their experiments, Stable Diffusion “copied” from the training data set roughly 1.88% of the time, the researchers say. That might not sound like much, but considering the reach of diffusion systems today — Stable Diffusion had created over 170 million images as of October, according to one ballpark estimate — it’s tough to ignore.

“Artists and content creators should absolutely be alarmed that others may be profiting off their content without consent,” the researcher said.

Implications

In the study, the co-authors note that none of the Stable Diffusion generations matched their respective LAION-Aesthetics source image and that not all models they tested were equally prone to copying. How often a model copied depended on several factors, including the size of the training data set; smaller sets tended to lead to more copying than larger sets.

One system the researchers probed, a diffusion model trained on the open source ImageNet data set, showed “no significant copying in any of the generations,” they wrote.

The co-authors also advised against excessive extrapolation from the study’s findings. Constrained by the cost of compute, they were only able to sample a small portion of Stable Diffusion’s full training set in their experiments.

Stable Diffusion copying
More examples of Stable Diffusion copying elements from its training data set. Image Credits: Somepalli et al.

Still, they say that the results should prompt companies to reconsider the process of assembling data sets and training models on them. Vendors behind systems such as Stable Diffusion have long claimed that fair use — the doctrine in U.S. law that permits the use of copyrighted material without first having to obtain permission from the rightsholder — protects them in the event that their models were trained on licensed content. But it’s an untested theory.

“Right now, the data is curated blindly, and the data sets are so large that human screening is infeasible,” the researcher said. “Diffusion models are amazing and powerful, and have showcased such impressive results that we cannot jettison them, but we should think about how to keep their performance without compromising privacy.”

For the businesses using diffusion models to power their apps and services, the research might give pause. In a previous interview with TechCrunch, Bradley J. Hulbert, a founding partner at law firm MBHB and an expert in IP law, said he believes that it’s unlikely a judge will see the copies of copyrighted works in AI-generated art as fair use — at least in the case of commercial systems like DALL-E 2. Getty Images, motivated out of those same concerns, has banned AI-generated artwork from its platform.

The issue will soon play out in the courts. In November, a software developer filed a class action lawsuit against Microsoft, its subsidiary GitHub and business partner OpenAI for allegedly violating copyright law with Copilot, GitHub’s AI-powered, code-generating service. The suit hinges on the fact that Copilot — which was trained on millions of examples of code from the internet — regurgitates sections of licensed code without providing credit.

Beyond the legal ramifications, there’s reason to fear that prompts could reveal, either directly or indirectly, some of the more sensitive data embedded in the image training data sets. As a recent Ars Technica report revealed, private medical records — as many as thousands — are among the photos hidden within Stable Diffusion’s set.

The co-authors propose a solution (not pioneered by them) in the form of a technique called differentially private training, which would “desensitize” diffusion models to the data used to train them — preserving the privacy of the original data in the process. Differentially private training usually harms performance, but that might be the price to pay to protect privacy and intellectual property moving forward if other methods fail, the researchers say.

“Once the model has memorized data, it’s very difficult to verify that a generated image is original,” the researcher said. “I think content creators are becoming aware of this risk.”

More TechCrunch

Founder-market fit is one of the most crucial factors in a startup’s success, and operators (someone involved in the day-to-day operations of a startup) turned founders have an almost unfair advantage…

OpenseedVC, which backs operators in Africa and Europe starting their companies, reaches first close of $10M fund

A Singapore High Court has effectively approved Pine Labs’ request to shift its operations to India.

Pine Labs gets Singapore court approval to shift base to India

The AI Safety Institute, a U.K. body that aims to assess and address risks in AI platforms, has said it will open a second location in San Francisco. 

UK opens office in San Francisco to tackle AI risk

Companies are always looking for an edge, and searching for ways to encourage their employees to innovate. One way to do that is by running an internal hackathon around a…

Why companies are turning to internal hackathons

Featured Article

I’m rooting for Melinda French Gates to fix tech’s broken ‘brilliant jerk’ culture

Women in tech still face a shocking level of mistreatment at work. Melinda French Gates is one of the few working to change that.

19 hours ago
I’m rooting for Melinda French Gates to fix tech’s  broken ‘brilliant jerk’ culture

Blue Origin has successfully completed its NS-25 mission, resuming crewed flights for the first time in nearly two years. The mission brought six tourist crew members to the edge of…

Blue Origin successfully launches its first crewed mission since 2022

Creative Artists Agency (CAA), one of the top entertainment and sports talent agencies, is hoping to be at the forefront of AI protection services for celebrities in Hollywood. With many…

Hollywood agency CAA aims to help stars manage their own AI likenesses

Expedia says Rathi Murthy and Sreenivas Rachamadugu, respectively its CTO and senior vice president of core services product & engineering, are no longer employed at the travel booking company. In…

Expedia says two execs dismissed after ‘violation of company policy’

Welcome back to TechCrunch’s Week in Review. This week had two major events from OpenAI and Google. OpenAI’s spring update event saw the reveal of its new model, GPT-4o, which…

OpenAI and Google lay out their competing AI visions

When Jeffrey Wang posted to X asking if anyone wanted to go in on an order of fancy-but-affordable office nap pods, he didn’t expect the post to go viral.

With AI startups booming, nap pods and Silicon Valley hustle culture are back

OpenAI’s Superalignment team, responsible for developing ways to govern and steer “superintelligent” AI systems, was promised 20% of the company’s compute resources, according to a person from that team. But…

OpenAI created a team to control ‘superintelligent’ AI — then let it wither, source says

A new crop of early-stage startups — along with some recent VC investments — illustrates a niche emerging in the autonomous vehicle technology sector. Unlike the companies bringing robotaxis to…

VCs and the military are fueling self-driving startups that don’t need roads

When the founders of Sagetap, Sahil Khanna and Kevin Hughes, started working at early-stage enterprise software startups, they were surprised to find that the companies they worked at were trying…

Deal Dive: Sagetap looks to bring enterprise software sales into the 21st century

Keeping up with an industry as fast-moving as AI is a tall order. So until an AI can do it for you, here’s a handy roundup of recent stories in the world…

This Week in AI: OpenAI moves away from safety

After Apple loosened its App Store guidelines to permit game emulators, the retro game emulator Delta — an app 10 years in the making — hit the top of the…

Adobe comes after indie game emulator Delta for copying its logo

Meta is once again taking on its competitors by developing a feature that borrows concepts from others — in this case, BeReal and Snapchat. The company is developing a feature…

Meta’s latest experiment borrows from BeReal’s and Snapchat’s core ideas

Welcome to Startups Weekly! We’ve been drowning in AI news this week, with Google’s I/O setting the pace. And Elon Musk rages against the machine.

Startups Weekly: It’s the dawning of the age of AI — plus,  Musk is raging against the machine

IndieBio’s Bay Area incubator is about to debut its 15th cohort of biotech startups. We took special note of a few, which were making some major, bordering on ludicrous, claims…

IndieBio’s SF incubator lineup is making some wild biotech promises

YouTube TV has announced that its multiview feature for watching four streams at once is now available on Android phones and tablets. The Android launch comes two months after YouTube…

YouTube TV’s ‘multiview’ feature is now available on Android phones and tablets

Featured Article

Two Santa Cruz students uncover security bug that could let millions do their laundry for free

CSC ServiceWorks provides laundry machines to thousands of residential homes and universities, but the company ignored requests to fix a security bug.

3 days ago
Two Santa Cruz students uncover security bug that could let millions do their laundry for free

TechCrunch Disrupt 2024 is just around the corner, and the buzz is palpable. But what if we told you there’s a chance for you to not just attend, but also…

Harness the TechCrunch Effect: Host a Side Event at Disrupt 2024

Decks are all about telling a compelling story and Goodcarbon does a good job on that front. But there’s important information missing too.

Pitch Deck Teardown: Goodcarbon’s $5.5M seed deck

Slack is making it difficult for its customers if they want the company to stop using its data for model training.

Slack under attack over sneaky AI training policy

A Texas-based company that provides health insurance and benefit plans disclosed a data breach affecting almost 2.5 million people, some of whom had their Social Security number stolen. WebTPA said…

Healthcare company WebTPA discloses breach affecting 2.5 million people

Featured Article

Microsoft dodges UK antitrust scrutiny over its Mistral AI stake

Microsoft won’t be facing antitrust scrutiny in the U.K. over its recent investment into French AI startup Mistral AI.

3 days ago
Microsoft dodges UK antitrust scrutiny over its Mistral AI stake

Ember has partnered with HSBC in the U.K. so that the bank’s business customers can access Ember’s services from their online accounts.

Embedded finance is still trendy as accounting automation startup Ember partners with HSBC UK

Kudos uses AI to figure out consumer spending habits so it can then provide more personalized financial advice, like maximizing rewards and utilizing credit effectively.

Kudos lands $10M for an AI smart wallet that picks the best credit card for purchases

The EU’s warning comes after Microsoft failed to respond to a legally binding request for information that focused on its generative AI tools.

EU warns Microsoft it could be fined billions over missing GenAI risk info

The prospects for troubled banking-as-a-service startup Synapse have gone from bad to worse this week after a United States Trustee filed an emergency motion on Wednesday.  The trustee is asking…

A US Trustee wants troubled fintech Synapse to be liquidated via Chapter 7 bankruptcy, cites ‘gross mismanagement’

U.K.-based Seraphim Space is spinning up its 13th accelerator program, with nine participating companies working on a range of tech from propulsion to in-space manufacturing and space situational awareness. The…

Seraphim’s latest space accelerator welcomes nine companies