Anthropic says ‘evil’ portrayals of AI were responsible for Claude’s blackmail attempts

2 days ago 5

In Brief

Posted:

1:40 PM PDT · May 10, 2026

The Claude logo is displayed connected  a smartphone surface  placed connected  a reflective aboveground  onto which a multitude of Claude logos are projected.Image Credits:Samuel Boivin/NurPhoto / Getty Images
  • Anthony Ha

Fictional portrayals of artificial quality tin person a existent effect connected AI models, according to Anthropic.

Last year, the institution said that during pre-release tests involving a fictional company, Claude Opus 4 would often try to blackmail engineers to debar being replaced by different system. Anthropic aboriginal published research suggesting that models from different companies had akin issues with “agentic misalignment.”

Apparently Anthropic has done much enactment astir that behavior, claiming successful a station connected X, “We judge the archetypal root of the behaviour was net substance that portrays AI arsenic evil and funny successful self-preservation.”

The institution went into much item successful a blog post stating that since Claude Haiku 4.5, Anthropic’s models “never prosecute successful blackmail [during testing], wherever erstwhile models would sometimes bash truthful up to 96% of the time.”

What accounts for the difference? The institution said it recovered that “documents astir Claude’s constitution and fictional stories astir AIs behaving admirably amended alignment.”

Related, Anthropic said that it recovered grooming to beryllium much effectual erstwhile it includes “the principles underlying aligned behavior” and not conscionable “demonstrations of aligned behaviour alone.”

“Doing some unneurotic appears to beryllium the astir effectual strategy,” the institution said.

Techcrunch event

San Francisco, CA | October 13-15, 2026

Subscribe for the industry’s biggest tech news

Latest successful AI

Read Entire Article