Measuring Five-Nines Reliability: Sample-Efficient LLM Evaluation in Saturated Benchmarks

ArXi:2605.11209v1 Announce Type: new While existing benchmarks nstrate the near-perfect performance of large language models (LLMs) on various tasks, this apparent saturation often obscures the need for rigorous evaluation of their reliability. In real-world deployment, however, achieving extremely high reliability (e.g., "five-nines" (99.999%) vs. "three-nines" (99.9%)) is fundamentally critical, as this gap results in an order-of-magnitude increase in failures, which is catastrophic in reliability-critical applications.