AI RESEARCH
AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages
arXiv CS.CL
•
ArXi:2604.08448v1 Announce Type: new AfriVoices-KE is a large-scale multilingual speech dataset comprising approximately 3,000 hours of audio across five Kenyan languages: Dholuo, Kikuyu, Kalenjin, Maasai, and Somali. The dataset includes 750 hours of scripted speech and 2,250 hours of spontaneous speech, collected from 4,777 native speakers across diverse regions and graphics. This work addresses the critical underrepresentation of African languages in speech technology by providing a high-quality, linguistically diverse resource.