AI RESEARCH
Compiling Activation Steering into Weights via Null-Space Constraints for Stealthy Backdoors
arXiv CS.CL
•
ArXi:2604.12359v1 Announce Type: cross Safety-aligned large language models (LLMs) are increasingly deployed in real-world pipelines, yet this deployment also enlarges the supply-chain attack surface: adversaries can distribute backdoored checkpoints that behave normally under standard evaluation but jailbreak when a hidden trigger is present. Recent post-hoc weight-editing methods offer an efficient approach to injecting such backdoors by directly modifying model weights to map a trigger to an attacker-specified response.