AI RESEARCH
Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models
arXiv CS.AI
•
ArXi:2603.05773v1 Announce Type: cross Safety alignment is often conceptualized as a monolithic process wherein harmfulness detection automatically triggers refusal. However, the persistence of jailbreak attacks suggests a fundamental mechanistic decoupling. We propose the \textbf{\underline{D}}isentangled \textbf{\underline{S}}afety \textbf{\underline{H}}ypothesis \textbf{(DSH)}, positing that safety computation operates on two distinct subspaces: a \textit{Recognition Axis} ($\mathbf{}_H$, ``Knowing'') and an \textit{Execution Axis} ($\mathbf{}_R$, ``Acting