AI RESEARCH
SomaliWeb v1: A Quality-Filtered Somali Web Corpus with a Matched Tokenizer and a Public Language-Identification Benchmark
arXiv CS.CL
•
ArXi:2605.18232v1 Announce Type: new Somali is a Cushitic language of the Horn of Africa with ~25M speakers, yet no documented dedicated Somali pre