AI RESEARCH
SafeRedirect: Defeating Internal Safety Collapse via Task-Completion Redirection in Frontier LLMs
arXiv CS.LG
•
ArXi:2604.20930v1 Announce Type: cross Internal Safety Collapse (ISC) is a failure mode in which frontier LLMs, when executing legitimate professional tasks whose correct completion structurally requires harmful content, spontaneously generate that content with safety failure rates exceeding 95%. Existing input-level defenses achieve a 100% failure rate against ISC, and standard system prompt defenses provide only partial mitigation. We propose SafeRedirect, a system-level override that defeats ISC by redirecting the model's task-completion drive rather than suppressing it.