AI RESEARCH
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
arXiv CS.CV
•
ArXi:2604.03231v1 Announce Type: new Recent vision-language models (VLMs) typically rely on a single vision encoder trained with contrastive image-text objectives, such as CLIP-style pre