RoadmapBench: Evaluating Long-Horizon Agentic Software Development Across Version Upgrades

ArXi:2605.15846v1 Announce Type: cross Coding agents are increasingly deployed in real software development, where a single version iteration requires months of coordinated work across many files. However, most existing benchmarks focus predominantly on single-issue bug fixes from Python repositories, with coarse pass/fail evaluation outcomes, and thus fail to capture long-horizon, multi-target development at real engineering scale.