An Executable Benchmarking Suite for Tool-Using Agents

ArXi:2605.11030v1 Announce Type: cross Closed-loop tool-using agents are increasingly evaluated in executable web, code, and micro-task environments, but benchmark reports often conflate workloads, action-generating drivers, and the evidence admitted for systems-facing claims. We present an executable benchmarking suite that makes these objects explicit under a shared evidence-admission contract.