Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives

ArXi:2511.06626v5 Announce Type: replace As AI systems become capable of complex agentic tasks, they also become capable of pursuing undesirable objectives and causing harm. Previous work has attempted to catch these unsafe instances by interrogating models directly about their objectives and behaviors. However, the main weakness of trusting interrogations is that models can lie. We propose self-report fine-tuning (SRFT), a simple supervised fine-tuning technique that trains models to occasionally make factual mistakes, then admit them when asked.