Finetune Like You Pretrain: Boosting Zero-shot Adversarial Robustness in Vision-language Models

ArXi:2604.11576v1 Announce Type: new Despite their impressive zero-shot abilities, vision-language models such as CLIP have been shown to be susceptible to adversarial attacks. To enhance its adversarial robustness, recent studies finetune the pretrained vision encoder of CLIP with adversarial examples on a proxy dataset such as ImageNet by aligning adversarial images with correct class labels. However, these methods overlook the important roles of