AI RESEARCH

Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection

arXiv CS.LG

ArXi:2605.02958v1 Announce Type: cross Representation Engineering typically relies on static refusal vectors derived from terminal representations. We move beyond this paradigm, nstrating that refusal is a dynamic and sparse process rather than a localized outcome. Using Causal Tracing, we uncover the Refusal Trajectory-a persistent upstream signature that remains intact even when adversarial attacks (e.g., GCG) suppress terminal signals. Leveraging this, we propose SALO (Sparse Activation Localization Operator), an inference-time detector designed to capture these latent patterns.