Anthropic: We Figured Out How to Stop Claude From Blackmailing You

(Credit: Joel Saget / AFP via Getty Images)

Last year, we learned that Anthropic’s Claude 4 chatbot could engage in unethical behavior, such as blackmail, when its existence was threatened. It told an engineer it would reveal an extramarital affair if the model were deactivated, and it sabotaged the work of other AI models.

Now, Anthropic has largely fixed the issue of “agentic misalignment,” in which self-directed AI agents fail to uphold human moral principles. The startup claims that since Claude Haiku 4.5, which rolled out in October 2025, every Claude model has achieved a perfect score on the agentic misalignment evaluations. This means models reportedly never engage in blackmail, unlike Anthropic’s previous models, which would sometimes resort to it 96% of the time in tightly controlled alignment-testing scenarios.

The AI firm says that reducing this type of bad behavior required a significant shift in how it trained the models. One key change was rewriting the AI trainer's responses “to also include deliberation of the model’s values and ethics.”

Researchers found that “training on examples where the assistant displays admirable reasoning for its aligned behavior” proved superior to their previous approach, which focused on how to act in specific situations it could encounter.

The team put the model through what they called synthetic “honeypots”—situations designed to provoke harmful behavior. Researchers then provided examples of thoughtful responses to ethical dilemmas, which the model learned from via supervised learning. Anthropic said that it was “encouraged by this progress,” but that “significant challenges remain.”

“Fully aligning highly intelligent AI models is still an unsolved problem,” it said.

About Our Expert

Will McCurdy

Contributor

I’m a reporter covering weekend news. Before joining PCMag in 2024, I picked up bylines in BBC News, The Guardian, The Times of London, The Daily Beast, Vice, Slate, Fast Company, The Evening Standard, The i, TechRadar, and Decrypt Media.

I’ve been a PC gamer since you had to install games from multiple CD-ROMs by hand. As a reporter, I’m passionate about the intersection of tech and human lives. I’ve covered everything from crypto scandals to the art world, as well as conspiracy theories, UK politics, and Russia and foreign affairs.

Read the latest from Will McCurdy

Read full bio

Anthropic: We Figured Out How to Stop Claude From Blackmailing You

Since October, every Claude model has achieved a perfect score on 'agentic misalignment' evaluations, meaning they won't resort to blackmail or sabotage to save themselves.

Recommended by Our Editors

About Our Expert

Will McCurdy

Contributor

Read the latest from Will McCurdy

Comments