AI RESEARCH

Do Language Models Know When They'll Refuse? Probing Introspective Awareness of Safety Boundaries

arXiv CS.CL

ArXi:2604.00228v1 Announce Type: new Large language models are trained to refuse harmful requests, but can they accurately predict when they will refuse before responding? We investigate this question through a systematic study where models first predict their refusal behavior, then respond in a fresh context. Across 3754 datapoints spanning 300 requests, we evaluate four frontier models: Claude Sonnet 4, Claude Sonnet 4.5, GPT-5.2, and Llama 3.1 405B. Using signal detection theory (SDT), we find that all models exhibit high.