Take Out Your Calculators: Estimating the Real Difficulty of Question Items with LLM Student Simulations

ArXi:2601.09953v2 Announce Type: replace Standardized math assessments require expensive human pilot studies to establish the difficulty of test items. We investigate the predictive value of open-source large language models (LLMs) for evaluating the difficulty of multiple-choice math questions for real-world students. We show that, while LLMs are poor direct judges of problem difficulty, simulation-based approaches with LLMs yield promising results under the right conditions.