Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning

ArXi:2602.17546v2 Announce Type: replace-cross Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and worsen under adversarial updates. Existing defenses often offer limited protection or force a trade-off between safety and utility. We