Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning

ArXi:2604.08147v1 Announce Type: cross Recent advances in audio-visual representation learning have shown the value of combining contrastive alignment with masked reconstruction. However, jointly optimizing these objectives in a single forward pass forces the contrastive branch to rely on randomly visible patches designed for reconstruction rather than cross-modal alignment,