AI RESEARCH

Do Modern Video-LLMs Need to Listen? A Benchmark Audit and Scalable Remedy

arXiv CS.CV

ArXi:2509.17901v2 Announce Type: replace Speech and audio encoders developed over years of community effort are routinely excluded from video understanding pipelines -- not because they fail, but because benchmarks never required listening. We audit 10 video benchmarks and find items largely solvable from visual cues alone: a single-frame probe answers ~77% of AVQA without audio, suggesting poor measurement of audio-visual reasoning. Building on LLaVA-OneVision, we attach a speech/audio encoder and compare five compressor architectures under 25x token reduction (25 Hz to 1 Hz.