AI SAFETY & ETHICS

Why Did My Model Do That? Model Incrimination for Diagnosing LLM Misbehavior

Alignment Forum

Authors: Aditya Singh*, Gerson Kroiz*, Senthooran Rajamanoharan, Neel Nanda Aditya and Gerson are co-first authors. This work was conducted during MATS 9.0 and was advised by Senthooran Rajamanoharan and Neel Nanda. Motivation Imagine that a frontier lab’s coding agent has been caught putting a bug in the key code for monitoring what that agent does. Naively, this seems like a clear smoking gun that the agent is scheming. But LLMs often do weird things; they could easily just be confused, or have made a mistake.