English to Central Kurdish Speech Translation: Corpus Creation, Evaluation, and Orthographic Standardization

ArXi:2604.00613v1 Announce Type: new We present KUTED, a speech-to-text translation (S2TT) dataset for Central Kurdish, derived from TED and TEDx talks. The corpus comprises 91,000 sentence pairs, including 170 hours of English audio, 1.65M English tokens, and 1.40M Central Kurdish tokens. We evaluate KUTED on the S2TT task and find that orthographic variation significantly degrades Kurdish translation performance, producing nonstandard outputs. To address this, we propose a systematic text standardization approach that yields substantial performance gains and consistent translations.