While anticipation builds for GPT-4, OpenAI quietly releases GPT-3.5

Released two years ago, OpenAI’s remarkably capable, if flawed, GPT-3 was perhaps the first to demonstrate that AI can write convincingly — if not perfectly — like a human. The successor to GPT-3, most likely called GPT-4, is expected to be unveiled in the near future, perhaps as soon as 2023. But in the meantime, OpenAI has quietly rolled out a series of AI models based on “GPT-3.5,” a previously-unannounced, improved version of GPT-3.

GPT-3.5 broke cover on Wednesday with ChatGPT, a fine-tuned version of GPT-3.5 that’s essentially a general-purpose chatbot. Debuted in a public demo yesterday afternoon, ChatGPT can engage with a range of topics, including programming, TV scripts and scientific concepts.

According to OpenAI, GPT-3.5 was trained on a blend of text and code published prior to Q4 2021. Like GPT-3 and other text-generating AI, GPT-3.5 learned the relationships between sentences, words and parts of words by ingesting huge amounts of content from the web, including hundreds of thousands of Wikipedia entries, social media posts and news articles.

Rather than release the fully trained GPT-3.5, OpenAI used it to create several systems fine-tuned for specific tasks — each available through the OpenAI API. One — text-davinci-003 — can handle more complex instructions than models built on GPT-3, according to the lab, and is measurably better at both long-form and “high-quality” writing.

According to OpenAI data scientist Jan Leike, text-davinci-003 is similar but not identical to InstructGPT, a family of GPT-3-based models released by OpenAI earlier this year that are less likely to generate problematic (e.g., toxic and highly biased) text while more closely aligning with a user’s intent. Text-davinci-003 — and by extension GPT-3.5 — “scores higher on human preference ratings” while suffering from “less severe” limitations, Leike said in a tweet.

That appears to be the case anecdotally. Data scientists at Pepper Content, a content marketing platform, report that text-davinci-003 “performs better in understanding the ‘context’ behind a request and then using that to produce better content” while “hallucinating” less than GPT-3-based models. (Where it concerns text-generating AI, hallucination refers to an AI writing inconsistent, factually incorrect statements.)

In a test on OpenAI’s Playground website, which provides a UI frontend for the models, the Pepper Content team fed several prompts to text-davinci-003 and a model based on GPT-3 (text-davinci-002). Given “What is the philosophy behind WeWork?,” the GPT-3.5-based text-davinci-003 generated this:

WeWork is a global workspace provider that believes people are the most important asset in any organization. The philosophy of WeWork is to create a collaborative environment that enables people to work together in a flexible and efficient way. The company encourages collaboration and productivity, while providing a comfortable and inspiring space.

WeWork is also committed to being a socially responsible organization, by finding ways to reduce its environmental impact, by providing meaningful work experiences, and by promoting diversity and inclusion. WeWork also strives to create meaningful experiences for its members, through its unique community-based programming, events and activities. The company believes that when people work together in an inspiring and collaborative environment, they can achieve more and create meaningful change.

It’s not perfect — note the excess commas and repetitiveness. But the copy’s certainly more engaging than what the GPT-3-based text-davinci-002 produced:

At WeWork, we believe in openness, collaboration, and community. We have reimagined what a workspace can be by bringing together a global community of creators, entrepreneurs, and startups — anyone looking to build something meaningful and transform the world. We are proud to be your neighbour, coworker, and friend.

GPT-3.5 is also better at generating blog posts, it seems. Here’s what the Pepper Content team got when they prompted text-davinci-003 to write a post about picking a sofa:

If you are looking to give your green living room a fresh look, adding the the best sofa for the space is a great way to start. However, there is more to choosing the perfect sofa than just finding a piece that looks good. You want to find a piece that is comfortable and fits your green living room décor.

In this blog, we will discuss how to pick sofa far your green living room. We will look at the different types of sofas available, the design factors to consider when making your selection, and how to choose a sofa that fits your style and budget. We will also provide tips on how to maintain and care for your new sofa. By the end of this blog, you will have all the information you need to pick the perfect for your green living room.

Again, it isn’t foolproof. GPT-3.5 oddly added the bit about a “green living room.” But also again, GPT-3 is more basic and less grammatically correct in its generation:

Sofa is one of the most basic requirements in a living room. It’s not just a piece of furniture but an important part of the décor of your living room. So, what should be the criteria while picking a sofa? If you are wondering about this then stay with me as I discuss the different aspects of the sofa would help you in picking the best one for yourself.

Experiments beyond Pepper Content’s suggest that GPT-3.5 tends to be much more sophisticated and thorough in its responses than GPT-3. For example, when YouTube channel All About AI prompted text-davinci-003 to write a history about AI, the model’s output mentioned key luminaries in the field, including Alan Turing and Arthur Samuelson, while text-davinci-002”s did not. All About AI also found that text-davinci-003 tended to have a more nuanced understanding of instructions, for instance providing details such as a title, description, outline, introduction and recap when asked to create a video script.

That’s no accident — a hallmark feature of text-davinci-003/GPT-3.5’s outputs is verboseness. (This writer can sympathize.) In an analysis, scientists at startup Scale AI found text-davinci-003/GPT-3.5 generates outputs roughly 65% longer than text-davinci-002/GPT-3 with identical prompts.

Perhaps less useful for most potential users but nonetheless entertaining, text-davinci-003/GPT-3.5 is superior at composing songs, limericks and rhyming poetry than its predecessor. Ars Technica reports that commenters on Y Combinator’s Hacker News forum used text-davinci-003 to write a poem explaining Albert Einstein’s theory of relativity and then re-write the poem in the style of John Keats. See:

If you want to understand Einstein’s thought
It’s not that hard if you give it a shot
General Relativity is the name of the game
Where space and time cannot remain the same
Mass affects the curvature of space
Which affects the flow of time’s race
An object’s motion will be affected
By the distortion that is detected
The closer you are to a large mass
The slower time will seem to pass
The farther away you may be
Time will speed up for you to see

The Scale AI team even found that text-davinci-003/GPT-3.5 has a notion of meters like iambic pentameter. See:

O gentle steeds, that bear me swift and

sure

Through fields of green and pathways so

obscure,

My heart doth swell with pride to be with

you

As on we ride the world a-fresh to view

The wind doth whistle through our hair so

free

And stirs a passion deep inside of me.

My soul doth lift, my spirits soar on high,

To ride with you, my truest friend, am I

Your strength and grace, your courage and

your fire,

Inspire us both to go beyond our sire.

No earthly bonds can hold us, only fate,

To gallop on, our wond’rous course create

Relatedly, GPT-3.5 is wittier than GPT-3 — at least from a subjective standpoint. Asking text-davinci-002/GPT-3 to “tell a joke” usually yields this:

Why did the chicken cross the road? To get to the other side.

Text-davinci-003/GPT-3.5 has cleverer responses:

Q: What did the fish say when it hit the wall? A: Dam!

Q: What did one ocean say to the other ocean? A: Nothing, they just waved.

Scale AI had the model explain Python code in the style of Eminem, a feat which text-davinci-002/GPT-3 simply couldn’t accomplish:

Yo, so I’m loopin’ through this list

With each item that I find

I’m gonna print out every letter in each one

of them

Dog, Cat, Banana, Apple, I’m gonna get’em

all with this rhyme

So why is GPT-3.5 better than GPT-3 in these particular areas? We can’t know the exact answer without additional details from OpenAI, which aren’t forthcoming; an OpenAI spokesperson declined a request for comment. But it’s safe to assume that GPT-3.5’s training approach had something to do with it. Like InstructGPT, GPT-3.5 was trained with the help of human trainers who ranked and rated the way early versions of the model responded to prompts. This information was then fed back into the system, which tuned its answers to match the trainers’ preferences.

Of course, this doesn’t make GPT-3.5 immune to the pitfalls to which all modern language models succumb. Because GPT-3.5 merely relies on statistical regularities in its training data rather than a human-like understanding of the world, it’s still prone to, in Leike’s words, “mak[ing] stuff up a bunch.” It also has limited knowledge of the world after 2021 because its training data is more sparse after that year. And the model’s safeguards against toxic output can be circumvented.

Still, GPT-3.5 and its derivative models demonstrate that GPT-4 — whenever it arrives — won’t necessarily need a huge number of parameters to best the most capable text-generating systems today. (Parameters are the parts of the model learned from historical training data and essentially define the skill of the model on a problem.) While some have predicted that GPT-4 will contain over 100 trillion parameters — nearly 600 times as many as GPT-3 — others argue that emerging techniques in language processing, like those seen in GPT-3.5 and InstructGPT, will make such a jump unnecessary.

One of those techniques could involve browsing the web for greater context, a la Meta’s ill-fated BlenderBot 3.0 chatbot. John Shulman, a research scientist and co-founder of OpenAI, told MIT Tech Review in a recent interview that OpenAI is continuing work on a language model it announced late last year, WebGPT, that can go and look up information on the web (via Bing) and give sources for its answers. At least one Twitter user appears to have found evidence of the feature undergoing testing for ChatGPT.

OpenAI has another reason to pursue lower-parameter models as it continues to evolve GPT-3: huge costs. A 2020 study from AI21 Labs pegged the expenses for developing a text-generating model with only 1.5 billion parameters at as much as $1.6 million. OpenAI has raised over $1 billion to date from Microsoft and other backers, and it’s reportedly in talks to raise more. But all investors, no matter how big, expect to see returns eventually.