GradShield: Alignment Preserving Finetuning

ArXi:2605.14194v1 Announce Type: new Large Language Models (LLMs) pose a significant risk of safety misalignment after finetuning, as models can be compromised by both explicitly and implicitly harmful data. Even some seemingly benign data can inadvertently steer a model towards misaligned behaviors. To address this, we