AI RESEARCH
Efficient Detection of Bad Benchmark Items with Novel Scalability Coefficients
arXiv CS.AI
•
ArXi:2603.24999v1 Announce Type: cross The validity of assessments, from large-scale AI benchmarks to human classrooms, depends on the quality of individual items, yet modern evaluation instruments often contain thousands of items with minimal psychometric vetting. We