ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

ArXi:2509.25843v2 Announce Type: replace Large language models (LLMs), despite being safety-aligned, exhibit brittle refusal behaviors that can be circumvented by simple linguistic changes. As tense jailbreaking nstrates that models refusing harmful requests often comply when rephrased in past tense, a critical generalization gap is revealed in current alignment methods whose underlying mechanisms are poorly understood. In this work, we