AI RESEARCH

SomaliWeb v1: A Quality-Filtered Somali Web Corpus with a Matched Tokenizer and a Public Language-Identification Benchmark

arXiv CS.CL

ArXi:2605.18232v1 Announce Type: new Somali is a Cushitic language of the Horn of Africa with ~25M speakers, yet no documented dedicated Somali pre