AI RESEARCH

Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion

arXiv CS.AI

ArXi:2604.10326v1 Announce Type: cross Large language models remain vulnerable to jailbreak attacks -- inputs designed to bypass safety mechanisms and elicit harmful responses -- despite advances in alignment and instruction tuning. We propose Head-Masked Nullspace Steering (HMNS), a circuit-level intervention that (i) identifies attention heads most causally responsible for a model's default behavior, (ii) suppresses their write paths via targeted column masking, and (iii) injects a perturbation constrained to the orthogonal complement of the muted subspace.