Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents

ArXi:2603.29231v1 Announce Type: new Existing benchmarks measure capability -- whether a model succeeds on a single attempt -- but production deployments require reliability -- consistent success across repeated attempts on tasks of varying duration. We show these properties diverge systematically as task duration grows, and that pass on short tasks is structurally blind to this divergence. (RDC), Variance Amplification Factor (VAF), Graceful Degradation Score (GDS), and Meltdown Onset Point.