AI RESEARCH
Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints
arXiv CS.AI
•
ArXi:2604.12384v1 Announce Type: new Safety alignment in Large Language Models (LLMs) remains highly fragile during fine-tuning, where even benign adaptation can degrade pre-trained refusal behaviors and enable harmful responses. Existing defenses typically constrain either weights or activations in isolation, without considering their coupled effects on safety. In this paper, we first theoretically nstrate that cons