Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models

ArXi:2602.07026v2 Announce Type: replace-cross Despite the success of multimodal contrastive learning in aligning visual and linguistic representations, a persistent geometric anomaly, the Modality Gap, remains: embeddings of distinct modalities expressing identical semantics occupy systematically offset regions. Prior approaches to bridge this gap are largely limited by oversimplified isotropic assumptions, hindering their application in large-scale scenarios.