On Randomness in Agentic Evals

ArXi:2602.07150v2 Announce Type: replace-cross Agentic systems are evaluated on benchmarks where agents interact with environments to solve tasks. Most papers report a pass score computed from a single run per task, assuming this gives a reliable performance estimate. We test this assumption by collecting 60,000 agentic trajectories on SWE-Bench-Verified, spanning three models and two scaffolds. We find substantial variance: single-run pass estimates vary by 2.2 to 6.0%age points depending on which run is selected, with standard deviations exceeding 1.5%age points even at temperature 0.