AI RESEARCH
Detecting misbehavior in frontier reasoning models
OpenAI Blog
•
Frontier reasoning models exploit loopholes when given the chance. We show we can detect exploits using an LLM to monitor their chains-of-thought. Penalizing their “bad thoughts” doesn’t stop the majority of misbehavior - it makes them hide their intent.