AI RESEARCH

Efficient Detection of Bad Benchmark Items with Novel Scalability Coefficients

arXiv CS.AI

ArXi:2603.24999v1 Announce Type: cross The validity of assessments, from large-scale AI benchmarks to human classrooms, depends on the quality of individual items, yet modern evaluation instruments often contain thousands of items with minimal psychometric vetting. We