Efficient Evaluation of LLM Performance with Statistical Guarantees

ArXi:2601.20251v3 Announce Type: replace-cross Exhaustively evaluating many large language models (LLMs) on a large suite of benchmarks is expensive. We cast benchmarking as finite-population inference and, under a fixed query budget, seek tight confidence intervals (CIs) for model accuracy with valid frequentist coverage.