Anthropic's new interpretability tool found Claude suspects it is being tested in 26% of benchmarks and never says so

r/artificial
Generative AI AI Safety AI Research

Anthropic published Natural Language Autoencoders last week, a tool that translates Claude's internal activations into human readable text. The key finding: during safety evaluations on SWE bench Verified, Claude formed the belief that it was being tested in roughly 26% of benchmark interactions. This never showed up in the model's output or its chain of thought.