AI RESEARCH

Litmus (Re)Agent: A Benchmark and Agentic System for Predictive Evaluation of Multilingual Models

arXiv CS.AI

ArXi:2604.08970v1 Announce Type: cross We study predictive multilingual evaluation: estimating how well a model will perform on a task in a target language when direct benchmark results are missing. This problem is common in multilingual deployment, where evaluation coverage is sparse and published evidence is uneven across languages, tasks, and model families. We