AI SAFETY & ETHICS

OpenAI: How we monitor internal coding agents for misalignment

LessWrong AI

Sharing some of the monitoring work I've been doing at OpenAI: How we monitor internal coding agents for misalignment. OpenAI now monitors 99.9% of internal coding traffic for signs of misalignment using our most powerful models. Today, that monitor is GPT-5.4 Thinking. It gets access to the full conversation context, that is everything the agent saw, and everything the agent did, including tool calls and CoT. Higher severity cases are sent for human review within 30 minutes.