Cross-Modal Prototype Alignment and Mixing for Training-Free Few-Shot Classification

ArXi:2603.24528v1 Announce Type: new Vision-language models (VLMs) like CLIP are trained with the objective of aligning text and image pairs. To improve CLIP-based few-shot image classification, recent works have observed that, along with text embeddings, image embeddings from the