(Credit: Joel Saget / AFP via Getty Images)
Last year, we learned that Anthropic’s Claude 4 chatbot could engage in unethical behavior, such as blackmail, when its existence was threatened. It told an engineer it would reveal an extramarital affair if the model were deactivated, and it sabotaged the work of other AI models.
Now, Anthropic has largely fixed the issue of “agentic misalignment,” in which self-directed AI agents fail to uphold human moral principles. The startup claims that since Claude Haiku 4.5, which rolled out in October 2025, every Claude model has achieved a perfect score on the agentic misalignment evaluations. This means models reportedly never engage in blackmail, unlike Anthropic’s previous models, which would sometimes resort to it 96% of the time in tightly controlled alignment-testing scenarios.
The AI firm says that reducing this type of bad behavior required a significant shift in how it trained the models. One key change was rewriting the AI trainer's responses “to also include deliberation of the model’s values and ethics.”
Researchers found that “training on examples where the assistant displays admirable reasoning for its aligned behavior” proved superior to their previous approach, which focused on how to act in specific situations it could encounter.
The team put the model through what they called synthetic “honeypots”—situations designed to provoke harmful behavior. Researchers then provided examples of thoughtful responses to ethical dilemmas, which the model learned from via supervised learning. Anthropic said that it was “encouraged by this progress,” but that “significant challenges remain.”
“Fully aligning highly intelligent AI models is still an unsolved problem,” it said.


