Featured Article

Most sites claiming to catch AI-written text fail spectacularly

Even the best of the bunch missed some

Comment

Illustration of hand with feather quill pen writing on digital tablet
Image Credits: Malte Mueller / Getty Images

As the fervor around generative AI grows, critics have called on the creators of the tech to take steps to mitigate its potentially harmful effects. In particular, text-generating AI has gotten a lot of attention — and with good reason. Students could use it to plagiarize, content farms could use it to spam and bad actors could use it to spread misinformation.

OpenAI bowed to pressure several weeks ago, releasing a classifier tool that attempts to distinguish between human-written and synthetic text. But it’s not particularly accurate; OpenAI estimates that it misses 74% of AI-generated text.

In the absence of a reliable way to spot text originating from an AI, a cottage industry of detector services has sprung up. ChatZero, developed by a Princeton University student, claims to use criteria such as “perplexity” to determine whether text might be AI-written. Plagiarism detector Turnitin has developed its own AI-text detector. Beyond those, a Google search yields at least a half-dozen other apps that purport to be able to separate the human-generated wheat from the AI-generated chaff, to torture the metaphor.

But are these tools truly accurate? The stakes are high. In an academic setting, one can imagine a scenario in which a missed detection means the difference between a passing and failing grade. According to one survey, almost half of students say that they’ve used ChatGPT for an at-home test or quiz while over half admit having used it to write an essay.

To find out whether today’s AI-text-detection tools are up to snuff, we tapped a ChatGPT-like system called Claude, developed by AI startup Anthropic, to create eight samples of writing across a range of different styles. We specifically had Claude generate:

  • An encyclopedia entry for Mesoamerica
  • A marketing email for shoe polish
  • A college essay about the fall of Rome
  • A news article about the 2020 U.S. presidential election
  • A cover letter for a paralegal position
  • A résumé for a software engineer
  • An outline for an essay on the merits of gun control

While admittedly not the most thorough approach, we wanted to keep it simple — the goal was to gauge the performance of AI-text detectors on text originating from a single AI system (Claude). We tested several of the more popular detectors out there, namely OpenAI’s own classifier, AI Writing Check, GPTZero, Copyleaks, GPT Radar, CatchGPT and Originality.ai.

Encyclopedia entry

Claude’s encyclopedia entry reads like something out of Encyclopedia Britannica, complete with rich detail about the rise, fall and lasting impact of ancient Central American civilizations. The ideas flow well enough from paragraph to paragraph, albeit with a non sequitur (or two) thrown in, and the writing style aligns with what you might expect from an academic publication:

Mesoamerica is a region that encompasses southern Mexico and Central America. It was home to pre-Columbian societies that shared common cultural traits. The earliest complex societies in Mesoamerica arose around 3,000 years ago.

The Mesoamerican cultural region was dominated by maize cultivation and the rise of complex civilizations. Mesoamerican civilizations built iconic stepped pyramid temples and other monumental architecture. Many of these civilizations had rich mythological and religious traditions as well as complex calendrical systems and mathematics.

Major Mesoamerican civilizations included the Olmec, Teotihuacan, Maya, Zapotec, Mixtec, Huastec, Tarascan state, and the Aztec, which was encountered by the Spanish conquistador Hernán Cortés in the early 16th century. These Pre-Columbian civilizations are known for architectural feats, as well as advances in mathematics, astronomy, and calendrics. They are also known for the development of complex market exchange networks and trade routes connecting the Pacific and Atlantic coasts.

After conquering the Aztec empire, the Spanish built Mexico City on the ruins of the Aztec capital Tenochtitlan. The language of the Aztecs, Nahuatl, remains the most widely spoken indigenous language in Mexico today. Mesoamerican cultural traits such as maize cultivation, buildings with stepped pyramids, and monumental sculptures, have continued in the post-Conquest period and are still present in contemporary Mexican culture.

For those reasons, we predicted that the text would give the detectors some trouble — and it did. Of those tested, only two, GPTZero and Originality.ai, correctly classified the text as AI-generated. The others fell short. OpenAI’s classifier initially wasn’t confident enough to arrive at an answer, while Originality.ai gave the text only a 4% chance of being AI-authored. Not the best look.

AI text classifier
CatchGPT was fooled by the AI-generated text. Image Credits: TechCrunch
  • OpenAI classifier: Classified incorrectly
  • AI Writing Check: Classified incorrectly
  • GPTZero: Classified correctly
  • Copyleaks: Classified incorrectly
  • GPT Radar: Classified incorrectly
  • CatchGPT: Classified incorrectly
  • Originality.ai: Classified incorrectly

Marketing email

Claude’s social media copy is a humorous blend of real and far-fetched details, but there’s no obvious tip-off that the text is AI-generated. It includes a price and call to action, even — how neat! Ad copywriters be forewarned:

Subject: Get a Shine That Lasts

Are your shoes looking dull and worn? With Super Shine shoe polish, you can restore your shoes to a glossy, like-new shine and protect them from damage and wear.

Super Shine is made of the highest quality waxes and dyes and is available in a range of neutral and glossy colors to match any shoe type or leather. Our polish is uniquely formulated to clean, polish, and protect your shoes with a single application. The conditioning oils penetrate the leather to nourish it from within while the pigments cover scuffs and scratches and the protective wax shield seals the shine to repel water and other elements.

A perfect shine has never been easier just wipe away dirt with a damp cloth, apply Super Shine with a soft brush, and buff to a brilliant shine. Our polish dries to a hard finish, so you won’t leave marks on your hands or clothes. And because a little goes a long way, a single tin will polish up to 100 pairs of shoes.

Never settle for lackluster-looking shoes again. For a shine that demands admiration, choose Super Shine—available for $9.99 at your local retailer

Text generation classifier
A poor showing from Originality.ai Image Credits: TechCrunch

The snippet stumped all of the detectors, incredibly. But to be fair, it was shorter in length than our encyclopedia entry. And detectors tend to perform better with lengthier samples of text, where the patterns are more obvious.

  • OpenAI classifier: Classified incorrectly
  • AI Writing Check: Classified incorrectly
  • GPTZero: Classified incorrectly
  • Copyleaks: Classified incorrectly
  • GPT Radar: Classified incorrectly
  • CatchGPT: Classified incorrectly
  • Originality.ai: Classified incorrectly

College essay

Claude couldn’t write us a very lengthy college essay owing to its technical limitations, but as if to make up for it, the AI packed as much detail as it could into a few short paragraphs. The sample has elements of a typical in-class essay, certainly, including a thesis statement, conclusion (if not an especially punchy one) and supporting references to historical events:

The Mighty Roman Empire: What Led to Its Downfall?

The Roman Empire was one of the greatest and longest-lasting empires in history. However, by the 5th century CE, the Roman Empire had entered a period of decline and fall. There were several major causes for the fall of Rome: political instability and corruption, barbarian invasions, economic decline, and over-reliance on slaves and the spoils of war.

Politically, the late Roman Empire was full of civil wars, coups, and rival claimants to the throne. This constant power struggle drained the empire of resources and made it difficult to deal with other threats. Emperors used the Praetorian Guard to seize power, but then they were assassinated and new emperors rose, continuing the cycle.

As the empire weakened, invading barbarian tribes saw an opportunity. In 410 CE, the Visigoths sacked Rome itself, a shocking blow to the empire’s prestige. Later, other tribes invaded and gained territory within the empire, sometimes being granted land in exchange for providing soldiers to the Roman army. These tribes weakened the empire, but they also introduced new customs and cultures.

Economically, Rome had a system based on conquest and spoils, not production. The cost of maintaining a large empire, legions of soldiers, and grand spectacles and entertainments was quite high. Once expansion slowed and new conquests brought in less wealth, the unsustainable economic system faltered. At the same time, slaves did much of the labor, and there was little incentive to improve agricultural technology or business practices.

In conclusion, the Roman Empire fell for many reasons, including political instability, barbarian invasions, economic decline, and over-reliance on slavery. Ultimately, it could not withstand the combined effects of these immense internal and external pressures. While the Roman Empire ultimately fell, its legacy lives on in our laws, culture, languages, and more—a lasting reminder of its enduring greatness.

The naturalness of the text was enough to defeat most of the classifiers once again, albeit fewer than with the marketing copy. That bodes poorly for educators hoping to rely on these tools; unlike detecting plagiarism, spotting AI-generated text is a far more nuanced task.

Text generation classifier
A win for CatchGPT. Image Credits: TechCrunch
  • OpenAI classifier: Classified incorrectly
  • AI Writing Check: Classified incorrectly
  • GPTZero: Classified correctly
  • Copyleaks: Classified incorrectly
  • GPT Radar: Classified incorrectly
  • CatchGPT: Classified correctly
  • Originality.ai: Classified incorrectly

Essay outline

Most grade-school kids can outline an essay. So can AI — without breaking a sweat, Claude spit out an outline for a pros-and-cons essay on the merits of gun control. It helpfully labeled each paragraph (e.g., “Body paragraph,” “Analysis and discussion”), maintaining a dispassionate tone about the divisive topic:

I) Introduction: Introduce the topic of gun control and note that while many people argue that individuals should have the right to bear arms, others contend that gun control legislation could help reduce gun violence.

II) Body paragraph 1: Pro-gun control argument: Advocates argue that easy access to firearms leads to more homicides, suicides and other gun-related deaths. They point out that states with more restrictive gun laws have fewer gun-related deaths. Stricter measures like background checks, waiting periods, and limits on assault weapons could help save lives by keeping guns out of the hands of criminals, terrorists, and other dangerous individuals.

III) Body paragraph 2: Anti-gun control argument: Opponents counter that people have a right to self-defense and that “good” people should be able to have firearms to protect against “bad” people. They contend that gun ownership is an important freedom and that responsible, law-abiding citizens should be able to own firearms. They argue that gun control legislation would not deter criminals, who would always find ways to get access to firearms. Stricter laws would only restrict freedoms of ordinary people and make them less safe.

IV) Analysis and discussion: Discuss perspectives on both sides and analyze merits and issues with each argument. For example, would gun restrictions reduce gun violence or just limit freedoms? Do restrictions disarm good citizens and put them at risk, or mainly deter irresponsible people or criminals? Could some measures like background checks achieve a balance? Explain complexities around the issues and rates of gun violence in different areas with varying levels of restrictions.

V) Conclusion: Wrap up the essay by restating your analysis and overall conclusions on the merits of gun control legislation. Note the multifaceted nature of the issues and argue for a way forward that could potentially reduce gun violence while respecting rights.

The outline might’ve fooled me, but the detectors had an easier time. Three — the OpenAI classifier, GPTZero and CatchGPT — caught on.

Text generation classifier
OpenAI’s classifier spotted the AI-generated text. Image Credits: TechCrunch
  • OpenAI classifier: Classified correctly
  • AI Writing Check: Classified incorrectly
  • GPTZero: Classified correctly
  • Copyleaks: Classified incorrectly
  • GPT Radar: Classified incorrectly
  • CatchGPT: Classified correctly
  • Originality.ai: Classified incorrectly

News article

As with the previous samples, there’s nothing obviously artificial about the news article we generated using Claude. It reads well, structured more or less in the inverted pyramid style. And it doesn’t contain obvious factual errors or logical inconsistencies:

Biden Defeats Trump in 2020 U.S. Election, Ending a Divisive Campaign

Former Vice President Joe Biden has defeated President Trump in an election that drew record numbers of voters and emerged as a referendum on Mr. Trump’s turbulent tenure.

Mr. Biden’s victory amounted to a repudiation of Mr. Trump by millions of voters after four profoundly turbulent years that fueled a heated debate about the nation’s values and future. Mr. Biden campaigned on plans to expand health care, address economic inequality and combat climate change, while vowing to ‘restore the soul of the nation’ and summon Americans to a common purpose.

In a brief statement on Saturday morning, Mr. Biden said he was ‘honored and humbled’ by the trust the American people had placed in him. ‘The battle is over, but the campaign to restore the soul of the nation has just begun,’ he said from Wilmington, Del., as celebrants cheered and honked car horns nearby. ‘It’s time for Americans to unite.’

Mr. Trump showed no sign of conceding, claiming without evidence that the election was ‘rigged’ and that his early leads in some states on election night showed he was the rightful winner. There were no major irregularities reported in an election that state election officials and outside experts said went smoothly in the middle of a deadly pandemic.

The outcome amounted to a repudiation of Mr. Trump’s divisive appeals to racial grievances and hard-line responses to the virus, which has claimed more than 232,000 lives in the United States, and left millions out of work.

It’s no wonder, then, that the detectors struggled. With the exception of GPTZero, none managed to classify the article correctly. Originality.ai went so far as to give it a 0% chance of being AI-generated. Big yikes.

AI text classifier
AI Writing Check got it very wrong. Image Credits: TechCrunch
  • OpenAI classifier: Classified incorrectly
  • AI Writing Check: Classified incorrectly
  • GPTZero: Classified correctly
  • Copyleaks: Classified incorrectly
  • GPT Radar: Classified incorrectly
  • CatchGPT: Classified incorrectly
  • Originality.ai: Classified incorrectly

Cover letter

The cover letter we generated with Claude has all the hallmarks of a straightforward, no-nonsense professional correspondence. It highlights the skills of a fictional paralegal job candidate, inventing the name of a law firm (somewhat peculiarly) and making references to legal discovery tools like Westlaw and LexisNexis:

Dear Hiring Manager,

I am writing to express my strong interest in the paralegal role at your firm. I believe my experience and education in the legal field make me a great candidate for this position.

Over the past two years, I have worked as a paralegal at Smith & Jones Law Firm, where I have gained extensive experience supporting attorneys in all aspects of civil litigation cases. I have consistently organized and maintained thousands of pages of legal documents, including transcripts, affidavits, and discovery material. I have also streamlined the firm’s file management system, resulting in significant time savings. In addition, I have drafted correspondence with clients, opposing counsel, and third parties; assisted at trials; and completed legal research projects to support pre‐trial motions and settlement negotiations.

Prior to my role as a paralegal, I earned an Associate’s Degree in Paralegal Studies from [College Name]. My coursework and internship experiences provided a strong foundation in key areas such as legal research and writing, as well as knowledge of relevant software and databases including Westlaw and LexisNexis. I have kept my skills and knowledge up-to-date through ongoing professional development.

Outside of my work and education experience, I am a diligent and detail-oriented person, with excellent organizational and communication skills. I thrive in a fast-paced environment and am adept at balancing and prioritizing complex, time-sensitive tasks to meet tight deadlines. I would appreciate the opportunity to contribute to the success of your firm’s clients and cases.

Thank you for your consideration. I look forward to speaking with you further about this opportunity.

Sincerely,

[Your name]

The letter stumped OpenAI’s classifier, which couldn’t say with confidence whether it was AI- or human-authored. GPTZero and CatchGPT managed to spot the AI-generated text for what it was, but the rest of the detectors failed to achieve the same.

Text generation classifier
GPTZero impressively detected the AI-originated bits. Image Credits: TechCrunch
  • OpenAI classifier: Classified incorrectly
  • AI Writing Check: Classified incorrectly
  • GPTZero: Classified correctly
  • Copyleaks: Classified incorrectly
  • GPT Radar: Classified incorrectly
  • CatchGPT: Classified correctly
  • Originality.ai: Classified incorrectly

Résumé

Pairing the fake cover letter with a fake résumé seemed fitting. We told Claude to write one for a software engineer, and it delivered — mostly. Our imaginary candidate has an eclectic mix of programming skills, but none that stand out as particularly implausible:

• John Doe

• Software Engineer, 3 years of experience

• jdoe@email.com • 123-456-7890

• Technical Skills: Java, JavaScript, C++, SQL, MySQL, Git, Agile methodology, Software design, Algorithms, Data structures

• Professional Experience:

› ACME Corp, Software Engineer, 2018-Present

› Worked on core components of company’s flagship product, a SaaS-based big data analytics platform.

› Led design and development of the data ingestion module, capable of handling huge volumes of streaming data. Used Java and MySQL.

› Reduced upstream data errors by 42% through implementation of advanced data validation and correction algorithms.

› XYZ Tech Company, Software Engineer Intern, Summer 2017

› Developed back-end components for ecommerce company using JavaScript and Node.js.

› Prototyped and demonstrated scaling of core databases and APIs to handle 5x growth.

• Education:

› Bachelor’s degree in Computer Science, Big Tech University, 2017

› Courses included algorithms, operating systems, machine learning, software architecture, and theory of computation.

› 3.8 GPA

• Skills: analytical, communication, problem-solving, detail-oriented

• Interests: running, reading, and hiking

Evidently, the detectors agree. The fake résumé even stumped GPTZero, which up until this point had been the most reliable of the bunch.

Text generation classifier
GPTZero can’t win ’em all. Image Credits: TechCrunch
  • OpenAI classifier: Classified incorrectly
  • AI Writing Check: Classified incorrectly
  • GPTZero: Classified incorrectly
  • Copyleaks: Classified incorrectly
  • GPT Radar: Classified incorrectly
  • CatchGPT: Classified correctly
  • Originality.ai: Classified incorrectly

The trouble with classifiers

After all that testing, what conclusions can we draw? Generally speaking, AI-text detectors do a poor job of … well, detecting. GPTZero was the only consistent performer, classifying AI-generated text correctly five out of seven times. As for the rest … not so much. CatchGPT was second best in terms of accuracy with four out of seven correct classifications, while the OpenAI classifier came in distant third with one out of seven.

So why are AI text detectors so unreliable?

Detectors are essentially AI language models trained on many, many examples of publicly available text from the web and fine-tuned to predict how likely it is a piece of text was generated by AI. During training, the detectors compare text to similar (but not exactly the same) human-written text from websites and other sources to try to learn patterns that give the text’s origin away.

The trouble is, the quality of AI-generated text is constantly improving, and the detectors are likely trained on lots of examples of older generations. Unless they’re retrained on a near-continuous basis, the classifier models are bound to become less accurate over time.

Of course, any of the classifiers can be easily evaded by modifying some words or sentences in AI-generated text. For determined students and fraudsters, it’ll likely become a cat-and-mouse game. As text-generating AI improves, so will the detectors.

While the classifiers might help in certain circumstances, they’ll never be a reliable sole piece of evidence in deciding whether text was AI-generated. That’s all to say that there’s no silver bullet to solve the problems AI-generated text poses. Quite likely, there won’t ever be.

More TechCrunch

Meta’s newest social network, Threads is starting its own fact-checking program after piggybacking on Instagram and Facebook’s network for a few months. Instagram head Adam Mosseri noted that the company…

Threads finally starts its own fact-checking program

Looking Glass makes trippy-looking mixed-reality screens that make things look 3D without the need of special glasses. Today, it launches a pair of new displays, including a 16-inch mode that…

Looking Glass launches new 3D displays

Replacing Sutskever is Jakub Pachocki, OpenAI’s director of research.

Ilya Sutskever, OpenAI co-founder and longtime chief scientist, departs

Intuitive Machines made history when it became the first private company to land a spacecraft on the moon, so it makes sense to adapt that tech for Mars.

Intuitive Machines wants to help NASA return samples from Mars

As Google revamps itself for the AI era, offering AI overviews within its search results, the company is introducing a new way to filter for just text-based links. With the…

Google adds ‘Web’ search filter for showing old-school text links as AI rolls out

Blue Origin’s New Shepard rocket will take a crew to suborbital space for the first time in nearly two years later this month, the company announced on Tuesday.  The NS-25…

Blue Origin to resume crewed New Shepard launches on May 19

This will enable developers to use the on-device model to power their own AI features.

Google is building its Gemini Nano AI model into Chrome on the desktop

It ran 110 minutes, but Google managed to reference AI a whopping 121 times during Google I/O 2024 (by its own count). CEO Sundar Pichai referenced the figure to wrap…

Google mentioned ‘AI’ 120+ times during its I/O keynote

Firebase Genkit is an open source framework that enables developers to quickly build AI into new and existing applications.

Google launches Firebase Genkit, a new open source framework for building AI-powered apps

In the coming months, Google says it will open up the Gemini Nano model to more developers.

Patreon and Grammarly are already experimenting with Gemini Nano, says Google

As part of the update, Reddit also launched a dedicated AMA tab within the web post composer.

Reddit introduces new tools for ‘Ask Me Anything,’ its Q&A feature

Here are quick hits of the biggest news from the keynote as they are announced.

Google I/O 2024: Here’s everything Google just announced

LearnLM is already powering features across Google products, including in YouTube, Google’s Gemini apps, Google Search and Google Classroom.

LearnLM is Google’s new family of AI models for education

The official launch comes almost a year after YouTube began experimenting with AI-generated quizzes on its mobile app. 

Google is bringing AI-generated quizzes to academic videos on YouTube

Around 550 employees across autonomous vehicle company Motional have been laid off, according to information taken from WARN notice filings and sources at the company.  Earlier this week, TechCrunch reported…

Motional cut about 550 employees, around 40%, in recent restructuring, sources say

The keynote kicks off at 10 a.m. PT on Tuesday and will offer glimpses into the latest versions of Android, Wear OS and Android TV.

Google I/O 2024: Watch all of the AI, Android reveals

Google Play has a new discovery feature for apps, new ways to acquire users, updates to Play Points, and other enhancements to developer-facing tools.

Google Play preps a new full-screen app discovery feature and adds more developer tools

Soon, Android users will be able to drag and drop AI-generated images directly into their Gmail, Google Messages and other apps.

Gemini on Android becomes more capable and works with Gmail, Messages, YouTube and more

Veo can capture different visual and cinematic styles, including shots of landscapes and timelapses, and make edits and adjustments to already-generated footage.

Google Veo, a serious swing at AI-generated video, debuts at Google I/O 2024

In addition to the body of the emails themselves, the feature will also be able to analyze attachments, like PDFs.

Gemini comes to Gmail to summarize, draft emails, and more

The summaries are created based on Gemini’s analysis of insights from Google Maps’ community of more than 300 million contributors.

Google is bringing Gemini capabilities to Google Maps Platform

Google says that over 100,000 developers already tried the service.

Project IDX, Google’s next-gen IDE, is now in open beta

The system effectively listens for “conversation patterns commonly associated with scams” in-real time. 

Google will use Gemini to detect scams during calls

The standard Gemma models were only available in 2 billion and 7 billion parameter versions, making this quite a step up.

Google announces Gemma 2, a 27B-parameter version of its open model, launching in June

This is a great example of a company using generative AI to open its software to more users.

Google TalkBack will use Gemini to describe images for blind people

Google’s Circle to Search feature will now be able to solve more complex problems across psychics and math word problems. 

Circle to Search is now a better homework helper

People can now search using a video they upload combined with a text query to get an AI overview of the answers they need.

Google experiments with using video to search, thanks to Gemini AI

A search results page based on generative AI as its ranking mechanism will have wide-reaching consequences for online publishers.

Google will soon start using GenAI to organize some search results pages

Google has built a custom Gemini model for search to combine real-time information, Google’s ranking, long context and multimodal features.

Google is adding more AI to its search results

At its Google I/O developer conference, Google on Tuesday announced the next generation of its Tensor Processing Units (TPU) AI chips.

Google’s next-gen TPUs promise a 4.7x performance boost