RA-SSU: Towards Fine-Grained Audio-Visual Learning with Region-Aware Sound Source Understanding

ArXi:2603.09809v1 Announce Type: new Audio-Visual Learning (AVL) is one fundamental task of multi-modality learning and embodied intelligence, displaying the vital role in scene understanding and interaction. However, previous researchers mostly focus on exploring downstream tasks from a coarse-grained perspective (e.g., audio-visual correspondence, sound source localization, and audio-visual event localization