SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks

ArXi:2603.24755v1 Announce Type: cross Software development is iterative, yet agentic coding benchmarks overwhelmingly evaluate single-shot solutions against complete specifications. Code can pass the test suite but become progressively harder to extend. Recent iterative benchmarks attempt to close this gap, but constrain the agent's design decisions too tightly to faithfully measure how code quality shapes future extensions. We