One Step to the Side: Why Defenses Against Malicious Finetuning Fail Under Adaptive Adversaries

ArXi:2605.14605v1 Announce Type: cross Model providers increasingly release open weights or allow users to fine-tune foundation models through APIs. Although these models are safety-aligned before release, their safeguards can often be removed by fine-tuning on harmful data. Recent defenses aim to make models robust to such malicious fine-tuning, but they are largely evaluated only against fixed attacks that do not account for the defense. We show that these robustness claims are incomplete.