The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains

ArXi:2605.11205v1 Announce Type: cross Benchmark evaluation across AI and safety-critical domains overwhelmingly relies on simple averaging. We nstrate that this practice produces substantially misleading rankings when two conditions co-occur: (1) the evaluation matrix is sparse and (2) items vary substantially in difficulty.