Efficient Detection of Bad Benchmark Items with Novel Scalability Coefficients

ArXi:2603.24999v1 Announce Type: cross The validity of assessments, from large-scale AI benchmarks to human classrooms, depends on the quality of individual items, yet modern evaluation instruments often contain thousands of items with minimal psychometric vetting. We