DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models

ArXi:2601.11895v2 Announce Type: replace DevBench is a telemetry-driven benchmark designed to evaluate Large Language Models (LLMs) on realistic code completion tasks. It includes 1,800 evaluation instances across six programming languages and six task categories derived from real developer telemetry, such as API usage and code purpose understanding. Unlike prior benchmarks, it emphasizes ecological validity, avoids