AI RESEARCH

Safety Anchor: Defending Harmful Fine-tuning via Geometric Bottlenecks

arXiv CS.CL

ArXi:2605.05995v1 Announce Type: cross The safety alignment of Large Language Models (LLMs) remains vulnerable to Harmful Fine-tuning (HFT). While existing defenses impose constraints on parameters, gradients, or internal representations, we observe that they can be effectively circumvented under persistent HFT. Our analysis traces this failure to the inherent redundancy of the high-dimensional parameter space: attackers exploit optimization trajectories that are orthogonal to defense constraints to re harmful capabilities while deceptively adhering to safety restrictions.