AI RESEARCH
BenchMarker: An Education-Inspired Toolkit for Highlighting Flaws in Multiple-Choice Benchmarks
arXiv CS.CL
•
ArXi:2602.06221v2 Announce Type: replace Multiple-choice question answering (MCQA) is standard in NLP, but benchmarks lack rigorous quality control. We present BenchMarker, an education-inspired toolkit using LLM judges to flag three common MCQ flaws: 1) contamination: items appearing exactly online; 2) shortcuts: cues in the choices that enable guessing; and 3) writing errors: structural/grammatical issues based on a 19-rule education rubric.