S2S-Arena: Evaluating Paralinguistic Instruction Following in Speech-to-Speech Models

ArXi:2503.05085v2 Announce Type: replace Recent advances in large language models (LLMs) have fundamentally reshaped speech-to-speech (S2S) systems, enabling increasingly natural spoken interaction. However, existing benchmarks still rely heavily on text-based evaluation and largely ignore paralinguistic cues such as prosody, emotion, and speaker traits, which are central to expressive and human-like communication. We