Learning Invariant Modality Representation for Robust Multimodal Learning from a Causal Inference Perspective

ArXi:2604.18460v1 Announce Type: new Multimodal affective computing aims to predict humans' sentiment, emotion, intention, and opinion using language, acoustic, and visual modalities. However, current models often learn spurious correlations that harm generalization under distribution shifts or noisy modalities. To address this, we propose a causal modality-invariant representation (CmIR) learning framework for robust multimodal learning. At its core, we