Compressing then Matching: An Efficient Pre-training Paradigm for Multimodal Embedding

ArXi:2511.08480v3 Announce Type: replace Multimodal Large Language Models advance multimodal representation learning by acquiring transferable semantic embeddings, thereby substantially enhancing performance across a range of vision-language tasks, including cross-modal retrieval, clustering, and classification. An effective embedding is expected to comprehensively preserve the semantic content of the input while simultaneously emphasizing features that are discriminative for downstream tasks.