A Novel Automatic Framework for Speaker Drift Detection in Synthesized Speech

ArXi:2604.06327v1 Announce Type: cross Recent diffusion-based text-to-speech (TTS) models achieve high naturalness and expressiveness, yet often suffer from speaker drift, a subtle, gradual shift in perceived speaker identity within a single utterance. This underexplored phenomenon undermines the coherence of synthetic speech, especially in long-form or interactive settings. We