Quantifying infrastructure noise in agentic coding evals (12 minute read)

TLDR AI
Generative AI

Agentic coding benchmarks are commonly used to compare the software engineering capabilities of frontier models. These scores are often treated as precise measurements of relative model capability. However, research shows that infrastructure configuration alone can produce significant differences in scores. Eval developers have begun accounting for this, but current fixes potentially change what benchmarks end up actually measuring.