The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models

ArXi:2601.02954v3 Announce Type: replace-cross Large audio-language models have made rapid progress in recognizing what is present in an audio clip, but spatial audio-language understanding still lacks a clear task interface. A model must also decide where sound events occur, which semantic and spatial attributes belong to the same auditory object, how multiple objects are arranged, and whether a scene-level answer is physically plausible.