Spectro-Temporal Modulation Representation Framework for Human-Imitated Speech Detection

ArXi:2604.23241v1 Announce Type: cross Human-imitated speech poses a greater challenge than AI-generated speech for both human listeners and automatic detection systems. Unlike AI-generated speech, which often contains artifacts, over-smoothed spectra, or robotic cues, imitated speech is produced naturally by humans, thereby preserving a higher degree of naturalness that makes imitation-based speech forgery significantly challenging to detect using conventional acoustic or cepstral features.