VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

ArXi:2605.06765v1 Announce Type: cross Human speech conveys expressiveness beyond linguistic content, including personality, mood, or performance elements, such as a comforting tone or humming a song, which we formalize as role-playing and singing. We present VITA-QinYu, the first expressive end-to-end (E2E) spoken language model (SLM) that goes beyond natural conversation to both role-playing and singing generation.