Anthropic’s Claude 2.1 release shows the competition isn’t rubbernecking the OpenAI disaster

The OpenAI train wreck may be playing out in slow motion before our eyes, but the company’s competition isn’t sitting around gawking. Anthropic just released Claude 2.1, an improvement on its flagship large language model that keeps it competitive with the GPT series — and now has the useful added feature of “being developed by a company not actively at war with itself.”

This new update to Claude has three major improvements: context window, accuracy and extensibility.

On the context window front, meaning how much data the model can pay attention to at once, Anthropic has leapfrogged OpenAI: The embattled Sam Altman announced a 128,000-token window back at the company’s Dev Day (seems so long ago!), and Claude 2.1 now can handle 200,000 tokens. That’s enough for “entire codebases, financial statements like S-1s, or even long literary works like The Iliad,” the company wrote.

Of course, having more info doesn’t necessarily mean the model handles it as well. GPT-4 is still the gold standard on code generation, for instance, and Claude will handle requests differently than its competitors, some better, some worse. It’s all a work in progress, and ultimately up to users to figure out how best to handle this new capacity.

Accuracy also supposedly gets a boost (this is a notoriously difficult concept to quantify), according to “a large set of complex, factual questions that probe known weaknesses in current models.” The results show that Claude 2.1 makes fewer incorrect answers, is less likely to hallucinate, and is better at estimating when it can’t be sure — the model is “significantly more likely to demur rather than provide incorrect information.” Again, how useful this is in practice can only be evaluated by users putting it to work.

Lastly, Claude 2.1 can now use tools, just like crows and bonobos. No sharp sticks for the LLM, however: It’s more like the agent functionality we’re seeing emerge in models meant to interact with web interfaces. If the model finds that its best move for a question isn’t to reason it out but to simply use a calculator, or a known API, it will do that instead.

For instance, if it doesn’t know which car or laptop to recommend for someone asking for product advice, it can call out to a model or database better equipped to answer that question, or even perform a web search if that’s appropriate.

These iterative improvements will surely be welcomed by the developers who employ Claude regularly, and show that every day at OpenAI that’s lost to power struggles is potentially one lost to the competition. Anthropic’s models may not always stand toe-to-toe with OpenAI’s, but this industry moves fast. A few free weeks to catch up might make more difference than anyone expects.