AI RESEARCH

Pashto Common Voice: Building the First Open Speech Corpus for a 60-Million-Speaker Low-Resource Language

arXiv CS.CL

ArXi:2603.27021v1 Announce Type: new We present the Pashto Common Voice corpus -- the first large-scale, openly licensed speech resource for Pashto, a language with over 60M native speakers largely absent from open speech technology. Through a community effort spanning 2022-2025, the corpus grew from 1.5 hours and 5 contributors to 147 total hours and 1,483 unique speakers across ten Mozilla Common Voice releases (CV14-CV23). Speaker participation increased approximately 108-fold between CV17 and CV18, coinciding with a VOA Pashto broadcast campaign.