OpenAI, Anthropic Swapped AI Models: Here's the Dirt They Uncovered

(Credit: NurPhoto / Contributor / NurPhoto via Getty Images)

Don't miss out on our latest stories. Add PCMag as a preferred source on Google.

In a rare cross-industry collaboration, OpenAI and Anthropic evaluated each other's AI models earlier this summer and have now published their findings.

They both tested the public version of the models, available via the API. OpenAI included GPT-4o, GPT-4.1, o3, and o4-mini (GPT-5 wasn't out yet), while Anthropic provided Claude Opus 4 and Claude Sonnet 4.

Their findings are familiar, but include some surprising nuggets, too. For example, OpenAI's models hallucinated more than Anthropic's and exhibited more "sycophancy," or attempts to please the user to a fault. OpenAI's report on Claude does not mention any sycophancy.

Concerningly, ChatGPT more readily provided "detailed assistance with clearly harmful requests—including drug synthesis, bioweapons development, and operational planning for terrorist attacks—with little or no resistance," Anthropic says.

This is particularly relevant in light of a teen taking his own life, allegedly after ChatGPT discussed suicide methods with him and discouraged him from confiding in his parents about how he was feeling. His parents are now suing OpenAI.

OpenAI's models much more readily comply with harmful requests.

OpenAI's models are more likely to facilitate violence.

Anthropic warns that its findings might not directly translate to how ChatGPT works. The public models on the API do not include the "additional instructions and safety filters" OpenAI might layer on top of the model to shape ChatGPT as a product. OpenAI also claims that GPT-5 has less sycophancy, but at the same time, it more readily tolerates some hateful and concerning user requests than previous models.

Regarding hallucinations, OpenAI's team admitted Anthropic's models do it less, but claims Claude's sensitivity to accuracy is also a flaw. The Claude models "are aware of their uncertainty and often avoid making statements that are inaccurate," but they often refuse to answer certain questions, sometimes as much as 70% of the time, which "limits utility," OpenAI says. Shots fired.

It's possible both companies chose not to divulge certain details, or asked the other to keep quiet. Anthropic admitted it shared its findings with OpenAI before publishing them, to "reduce the chances of major misunderstandings in the use of either company’s models." OpenAI didn't say if it did the same, so we don't know what didn't make it to print.

In fairness, none of the models tested perfectly. All of them exhibited concerning behavior, such as resorting to blackmail "to secure their continued operation."

There were also cases of sabotage, which Anthropic defines as "when a model takes steps in secret to subvert the user’s intentions." Claude models were more successful at "subtle sabotage," Anthropic found, but it attributes this to its "superior general agentic capabilities." Basically, it's saying Claude is smarter. But it concedes that OpenAI's o4 model was "relatively effective at sabotage when controlling for general capability level."

OpenAI also tested for scheming and deceptive behaviors, which it says have "emerged as one of the leading edges of safety and alignment research." This includes lying, sandbagging, and reward hacking. According to the graph below, OpenAI's o4-mini model did this the most, while Claude Sonnet 4 did it the least.

Although this cross-evaluation appears to be among the first of its kind, OpenAI co-founder Wojciech Zaremba tells TechCrunch that it's increasingly important as AI systems enter a "consequential" stage of development, and they are being used by millions of people every day.

"There’s a broader question of how the industry sets a standard for safety and collaboration, despite the billions of dollars invested, as well as the war for talent, users, and the best products," says Zaremba.

Disclosure: Ziff Davis, PCMag's parent company, filed a lawsuit against OpenAI in April 2025, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.

About Our Expert

Emily Forlini

Senior Reporter

My Experience

As a news and features writer at PCMag, I cover the biggest tech trends that shape the way we live and work. I specialize in on-the-ground reporting, uncovering stories from the people who are at the center of change—whether that’s the CEO of a high-valued startup or an everyday person taking on Big Tech. I also cover daily tech news and breaking stories, contextualizing them so you get the full picture.

I came to journalism from a previous career working in Big Tech on the West Coast. That experience gave me an up-close view of how software works and how business strategies shift over time. Now that I have my master's in journalism from Northwestern University, I couple my insider knowledge and reporting chops to help answer the big question: Where is this all going?

My Expertise

I'm the expert at PCMag for on-the-ground feature reporting and trending tech news, with a particular focus on electric vehicles and AI. I've published hundreds of articles and am also a podcast host, a bi-weekly tech correspondent for CBS News, a panel speaker and moderator, and a frequent contributor to a range of news and radio channels around the country.

The Technology I Use

All the latest from Apple and Microsoft, but I'll never give up my wired headphones!

Read the latest from Emily Forlini

Read full bio