MGSM-Pro: A Simple Strategy for Robust Multilingual Mathematical Reasoning Evaluation

ArXi:2601.21225v2 Announce Type: replace Large language models have made substantial progress in mathematical reasoning. However, benchmark development for multilingual evaluation has lagged behind English in both difficulty and recency. Recently, GSM-Symbolic showed a strong evidence of high variance when models are evaluated on different instantiations of the same question; however, the evaluation was conducted only in English. In this paper, we