Group Cognition Learning: Making Everything Better Through Governed Two-Stage Agents Collaboration

ArXi:2605.00370v1 Announce Type: new Centralized multimodal learning commonly compresses language, acoustic, and visual signals into a single fused representation for prediction. While effective, this paradigm suffers from two limitations: modality dominance, where optimization gravitates towards the path of least resistance, ignoring weaker but informative modalities, and spurious modality coupling, where models overfit to incidental cross-modal correlations.