MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark

ArXi:2506.04779v3 Announce Type: replace Speech inherently contains rich acoustic information that extends far beyond the textual language. In real-world spoken language understanding, effective interpretation often requires integrating semantic meaning (e.g., content), paralinguistic features (e.g., emotions, speed, pitch) and phonological characteristics (e.g., prosody, intonation, rhythm), which are embedded in speech.