Intrinsic Gradient Suppression for Label-Noise Prompt Tuning in Vision-Language Models

ArXi:2605.00591v1 Announce Type: new Contrastive vision-language models like CLIP exhibit remarkable zero-shot generalization. However, prompt tuning remains highly sensitive to label noise, as mislabeled samples generate disproportionately large gradients that can overwhelm pre-trained priors. We argue that because CLIP already provides a near-optimal initialization, adaptation should be inherently conservative, particularly against the extreme gradient updates common in noisy settings.