Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation

ArXi:2510.24358v2 Announce Type: replace-cross Recent advances in code agents have enabled automated software development at the project level, ed by large language models (LLMs). However, existing benchmarks for code agent evaluation face two major limitations. First, creating high-quality project-level evaluation datasets requires extensive domain expertise, leading to prohibitive annotation costs and limited diversity.