AI RESEARCH

Detecting misbehavior in frontier reasoning models

OpenAI Blog

Frontier reasoning models exploit loopholes when given the chance. We show we can detect exploits using an LLM to monitor their chains-of-thought. Penalizing their “bad thoughts” doesn’t stop the majority of misbehavior - it makes them hide their intent.