ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching

ArXi:2507.09318v2 Announce Type: replace-cross Generating spoken dialogue is inherently complex than monologue text-to-speech (TTS), as it demands both realistic turn-taking and the maintenance of distinct speaker timbres. While existing autoregressive (AR) models have made progress, they often suffer from high inference latency and stability issues. To overcome these limitations, we propose ZipVoice-Dialog, a non-autoregressive (NAR) zero-shot spoken dialogue generation model based on flow-matching.