How Far Are We? Systematic Evaluation of LLMs vs. Human Experts in Mathematical Contest in Modeling

ArXi:2604.04791v1 Announce Type: new Large language models (LLMs) have achieved strong performance on reasoning benchmarks, yet their ability to solve real-world problems requiring end-to-end workflows remains unclear. Mathematical modeling competitions provide a stringent testbed for evaluating such end-to-end problem-solving capability. We propose a problem-oriented, stage-wise evaluation framework that assesses LLM performance across modeling stages using expert-verified criteria.