ReStyle-TTS: Relative and Continuous Style Control for Zero-Shot Speech Synthesis

ArXi:2601.03632v2 Announce Type: replace-cross Zero-shot text-to-speech models can clone a speaker's timbre from a short reference audio, but they also strongly inherit the speaking style present in the reference. As a result, synthesizing speech with a desired style often requires carefully selecting reference audio, which is impractical when only limited or mismatched references are available. While recent controllable TTS methods attempt to address this issue, they typically rely on absolute style targets and discrete textual prompts, and. therefore.