Unifying Speech Editing Detection and Content Localization via Prior-Enhanced Audio LLMs

ArXi:2601.21463v2 Announce Type: replace-cross Existing speech editing detection (SED) datasets are predominantly constructed using manual splicing or limited editing operations, resulting in restricted diversity and poor coverage of realistic editing scenarios. Meanwhile, current SED methods rely heavily on frame-level supervision to detect observable acoustic anomalies, which fundamentally limits their ability to handle deletion-type edits, where the manipulated content is entirely absent from the signal.