CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

ArXi:2604.03231v1 Announce Type: new Recent vision-language models (VLMs) typically rely on a single vision encoder trained with contrastive image-text objectives, such as CLIP-style pre