XQ-MEval: A Dataset with Cross-lingual Parallel Quality for Benchmarking Translation Metrics

ArXi:2604.14934v1 Announce Type: new Automatic evaluation metrics are essential for building multilingual translation systems. The common practice of evaluating these systems is averaging metric scores across languages, yet this is suspicious since metrics may suffer from cross-lingual scoring bias, where translations of equal quality receive different scores across languages. This problem has not been systematically studied because no benchmark exists that provides parallel-quality instances across languages, and expert annotation is not realistic.