AI RESEARCH
Who Benchmarks the Benchmarks? A Case Study of LLM Evaluation in Icelandic
arXiv CS.AI
•
ArXi:2603.16406v1 Announce Type: cross This paper evaluates current Large Language Model (LLM) benchmarking for Icelandic, identifies problems, and calls for improved evaluation methods in low/medium-resource languages in particular. We show that benchmarks that include synthetic or machine-translated data that have not been verified in any way, commonly contain severely flawed test examples that are likely to skew the results and undermine the tests' validity.