CLIP Is Shortsighted: Paying Attention Beyond the First Sentence

ArXi:2602.22419v2 Announce Type: replace CLIP models learn transferable multi-modal features via image-text contrastive learning on internet-scale data. They are widely used in zero-shot classification, multi-modal retrieval, text-to-image diffusion, and as image encoders in large vision-language models. However, CLIP's pre