AI RESEARCH

Agent psychometrics: Task-level performance prediction in agentic coding benchmarks

arXiv CS.AI

ArXi:2604.00594v1 Announce Type: new As the focus in LLM-based coding shifts from static single-step code generation to multi-step agentic interaction with tools and environments, understanding which tasks will challenge agents and why becomes increasingly difficult. This is compounded by current practice: agent performance is typically measured by aggregate pass rates on benchmarks, but single-number metrics obscure the diversity of tasks within a benchmark. We present a framework for predicting success or failure on individual tasks tailored to the agentic coding regime.