AI RESEARCH

NSMQ Riddles: A Benchmark of Scientific and Mathematical Riddles for Quizzing Large Language Models

arXiv CS.CL

ArXi:2605.07051v1 Announce Type: new Large Language Models (LLMs) have shown good performance on various science educational benchmarks, nstrating their potential for use in science and mathematics education. Yet, LLMs tend to be evaluated on science and mathematical educational datasets from the Western world, with an underrepresentation of datasets from the Global South. Furthermore, they tend to have multiple-choice answer options that are trivial to evaluate.