Phoneme-Level Deepfake Detection Across Emotional Conditions Using Self-Supervised Embeddings

ArXi:2605.03079v1 Announce Type: cross Recent advances in emotional voice conversion (EVC) have enabled the generation of expressive synthetic speech, raising new concerns in audio deepfake detection. Existing approaches treat speech as a homogeneous signal and largely overlook its internal phonetic structure, limiting their interpretability in emotionally conditioned settings.