Aligning the True Semantics: Constrained Decoupling and Distribution Sampling for Cross-Modal Alignment

ArXi:2603.05566v1 Announce Type: new Cross-modal alignment is a crucial task in multimodal learning aimed at achieving semantic consistency between vision and language. This requires that image-text pairs exhibit similar semantics. Traditional algorithms pursue embedding consistency to achieve semantic consistency, ignoring the non-semantic information present in the embedding. An intuitive approach is to decouple the embeddings into semantic and modality components, aligning only the semantic component. However, this