ATIR: Towards Audio-Text Interleaved Contextual Retrieval

ArXi:2604.20267v1 Announce Type: cross Audio carries richer information than text, including emotion, speaker traits, and environmental context, while also enabling lower-latency processing compared to speech-to-text pipelines. However, recent multimodal information retrieval research has predominantly focused on images, largely overlooking audio, especially in the setting of interleaved audio-text contextual retrieval. In this work, we