AutoBe benchmark: structured harness narrows frontier-vs-local gap in backend generation [D]

AutoBe is a benchmark for end-to-end backend generation. One natural language request produces six outputs: requirements analysis, ERD, OpenAPI spec, E2E tests, NestJS implementation, and a type-safe SDK. Each phase fills a predefined AST via structured function calling rather than generating unstructured code. The scoring rubric is 100 points driven entirely by static analysis - the same artifact scores the same regardless of who reruns it. The headline finding is that scores cluster tightly. GLM 5 tops the benchmark run. qwen3.5-27b sits directly behind frontier models.