Mechanistic Steering of LLMs Reveals Layer-wise Feature Vulnerabilities in Adversarial Settings

ArXi:2604.23130v1 Announce Type: cross Large language models (LLMs) can still be jailbroken into producing harmful outputs despite safety alignment. Existing attacks show this vulnerability, but not the internal mechanisms that cause it. This study asks whether jailbreak success is driven by identifiable internal features rather than prompts alone. We propose a three-stage pipeline for Gemma-2-2B using the BeaverTails dataset. First, we extract concept-aligned tokens from adversarial responses via subspace similarity.