Multi-lingual Functional Evaluation for Large Language Models

ArXi:2506.20793v2 Announce Type: replace Multi-lingual competence in large language models is often evaluated via static data benchmarks such as Belebele, M-MMLU and M-GSM. However, these evaluations often fail to provide an adequate understanding of the practical performance and robustness of models across multi-lingual settings.