Is CLIP Cross-Eyed? Revealing and Mitigating Center Bias in the CLIP Family

ArXi:2604.05971v1 Announce Type: cross Recent research has shown that contrastive vision-language models such as CLIP often lack fine-grained understanding of visual content. While a growing body of work has sought to address this limitation, we identify a distinct failure mode in the CLIP family, which we term center bias, that persists even in recent model variants. Specifically, CLIP tends to disproportionately focus on the central region of an image, overlooking important objects located near the boundaries.