AI Agents Can Pass Tests. They Still Can't Maintain Systems.

AI coding tools have made writing software dramatically easier. Cursor can scaffold entire modules from a prompt. Claude can produce working CLIs in minutes. Copilot fills in entire implementations as you type. Code production is no longer the primary bottleneck in software engineering. But a new research benchmark from Alibaba highlights something important: Writing code is not the same as maintaining a system. The SWE-CI Benchmark Most AI coding benchmarks test one thing: can a model produce working code once? Examples include HumanEval, MBPP, and SWE-bench.