Toward Automated Robustness Evaluation of Mathematical Reasoning

ArXi:2506.05038v2 Announce Type: replace Large Language Models (LLMs) have nstrated remarkable capabilities in various reasoning-intensive tasks. However, these models exhibit unexpected brittleness, often failing on simple variations of the same underlying task. Existing robustness evaluations predominantly rely on hand-crafted templates or a limited set of perturbation rules. Consequently, such approaches lack the adaptability to probe latent vulnerabilities unique to specific models and remain susceptible to data contamination.