(Credit: NurPhoto / Contributor / NurPhoto via Getty Images)
Don't miss out on our latest stories. Add PCMag as a preferred source on Google.
In a rare cross-industry collaboration, OpenAI and Anthropic evaluated each other's AI models earlier this summer and have now published their findings.
They both tested the public version of the models, available via the API. OpenAI included GPT-4o, GPT-4.1, o3, and o4-mini (GPT-5 wasn't out yet), while Anthropic provided Claude Opus 4 and Claude Sonnet 4.
Their findings are familiar, but include some surprising nuggets, too. For example, OpenAI's models hallucinated more than Anthropic's and exhibited more "sycophancy," or attempts to please the user to a fault. OpenAI's report on Claude does not mention any sycophancy.
Concerningly, ChatGPT more readily provided "detailed assistance with clearly harmful requests—including drug synthesis, bioweapons development, and operational planning for terrorist attacks—with little or no resistance," Anthropic says.
This is particularly relevant in light of a teen taking his own life, allegedly after ChatGPT discussed suicide methods with him and discouraged him from confiding in his parents about how he was feeling. His parents are now suing OpenAI.


Anthropic warns that its findings might not directly translate to how ChatGPT works. The public models on the API do not include the "additional instructions and safety filters" OpenAI might layer on top of the model to shape ChatGPT as a product. OpenAI also claims that GPT-5 has less sycophancy, but at the same time, it more readily tolerates some hateful and concerning user requests than previous models.
Regarding hallucinations, OpenAI's team admitted Anthropic's models do it less, but claims Claude's sensitivity to accuracy is also a flaw. The Claude models "are aware of their uncertainty and often avoid making statements that are inaccurate," but they often refuse to answer certain questions, sometimes as much as 70% of the time, which "limits utility," OpenAI says. Shots fired.
It's possible both companies chose not to divulge certain details, or asked the other to keep quiet. Anthropic admitted it shared its findings with OpenAI before publishing them, to "reduce the chances of major misunderstandings in the use of either company’s models." OpenAI didn't say if it did the same, so we don't know what didn't make it to print.
In fairness, none of the models tested perfectly. All of them exhibited concerning behavior, such as resorting to blackmail "to secure their continued operation."
There were also cases of sabotage, which Anthropic defines as "when a model takes steps in secret to subvert the user’s intentions." Claude models were more successful at "subtle sabotage," Anthropic found, but it attributes this to its "superior general agentic capabilities." Basically, it's saying Claude is smarter. But it concedes that OpenAI's o4 model was "relatively effective at sabotage when controlling for general capability level."
OpenAI also tested for scheming and deceptive behaviors, which it says have "emerged as one of the leading edges of safety and alignment research." This includes lying, sandbagging, and reward hacking. According to the graph below, OpenAI's o4-mini model did this the most, while Claude Sonnet 4 did it the least.

Although this cross-evaluation appears to be among the first of its kind, OpenAI co-founder Wojciech Zaremba tells TechCrunch that it's increasingly important as AI systems enter a "consequential" stage of development, and they are being used by millions of people every day.
"There’s a broader question of how the industry sets a standard for safety and collaboration, despite the billions of dollars invested, as well as the war for talent, users, and the best products," says Zaremba.
Disclosure: Ziff Davis, PCMag's parent company, filed a lawsuit against OpenAI in April 2025, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.


