AfriMTEB and AfriE5: Benchmarking and Adapting Text Embedding Models for African Languages

ArXi:2510.23896v2 Announce Type: replace Text embeddings are an essential building component of several NLP tasks such as retrieval-augmented generation which is crucial for preventing hallucinations in LLMs. Despite the recent release of massively multilingual MTEB (MMTEB), African languages remain underrepresented, with existing tasks often repurposed from translation benchmarks such as FLORES clustering or SIB-200. In this paper, we