OpenAVS: Training-Free Open-Vocabulary Audio Visual Segmentation with Foundational Models

ArXi:2505.01448v2 Announce Type: replace Audio-visual segmentation aims to separate sounding objects from videos by predicting pixel-level masks based on audio signals. Existing methods primarily concentrate on closed-set scenarios and direct audio-visual alignment and fusion, which limits their capability to generalize to new, unseen situations. In this paper, we propose OpenAVS, a novel