Perturb and Recover: Fine-tuning for Effective Backdoor Removal from CLIP

ArXi:2412.00727v3 Announce Type: replace Vision-Language models like CLIP have been shown to be highly effective at linking visual perception and natural language understanding, enabling sophisticated image-text capabilities, including strong retrieval and zero-shot classification performance. Their widespread use, as well as the fact that CLIP models are trained on image-text pairs from the web, make them both a worthwhile and relatively easy target for backdoor attacks. As