AI RESEARCH

The Thiomi Dataset: A Large-Scale Multimodal Corpus for Low-Resource African Languages

arXiv CS.CL

ArXi:2603.29244v1 Announce Type: new We present the Thiomi Dataset, a large-scale multimodal corpus spanning ten African languages across four language families: Swahili, Kikuyu, Kamba, Kimeru, Luo, Maasai, Kipsigis, Somali (East Africa); Wolof (West Africa); and Fulani (West/Central Africa). The dataset contains over 601,000 approved sentence-level text annotations and over 385,000 audio recordings across nine languages, collected through a dedicated community data collection platform involving over 100 contributors.