Relational graph-driven differential denoising and diffusion attention fusion for multimodal conversation emotion recognition

ArXi:2603.25752v1 Announce Type: new In real-world scenarios, audio and video signals are often subject to environmental noise and limited acquisition conditions, resulting in extracted features containing excessive noise. Furthermore, there is an imbalance in data quality and information carrying capacity between different modalities. These two issues together lead to information distortion and weight bias during the fusion phase, impairing overall recognition performance.