Sparse Autoencoders enable Robust and Interpretable Fine-tuning of CLIP models

ArXi:2605.15961v1 Announce Type: new Large-scale pre-trained vision-language models like CLIP nstrate remarkable zero-shot performance across diverse tasks. However, fine-tuning these models to improve downstream performance often degrades robustness against distribution shifts. Recent approaches have attempted to mitigate this trade-off, but often rely on computationally expensive text-guidance. We propose a novel method for robust fine-tuning, SAE-FT, which operates only on the model's visual representations.