EvolveTool-Bench: Evaluating the Quality of LLM-Generated Tool Libraries as Software Artifacts

ArXi:2604.00392v1 Announce Type: cross Modern LLM agents increasingly create their own tools at runtime -- from Python functions to API clients -- yet existing benchmarks evaluate them almost exclusively by downstream task completion. This is analogous to judging a software engineer only by whether their code runs, ignoring redundancy, regression, and safety. We