Topology-Aware Representation Alignment for Semi-Supervised Vision-Language Learning

ArXi:2604.26370v1 Announce Type: cross Vision-language models have shown strong performance, but they often generalize poorly to specialized domains. While semi-supervised vision-language learning mitigates this limitation by leveraging a small set of labeled image-text pairs together with abundant unlabeled images, existing methods remain fundamentally pairwise and fail to model the global structure of multimodal representation manifolds.