Selective Classifier-free Guidance for Zero-shot Text-to-speech

ArXi:2509.19668v2 Announce Type: replace-cross In zero-shot text-to-speech, achieving a balance between fidelity to the target speaker and adherence to text content remains a challenge. While classifier-free guidance (CFG) strategies have shown promising results in image generation, their application to speech synthesis are underexplored. Separating the conditions used for CFG enables trade-offs between different desired characteristics in speech synthesis.