Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

ArXi:2604.06132v1 Announce Type: new Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. However, existing agent benchmarks suffer from three critical limitations: (1) trajectory-opaque grading that checks only final outputs, (2) underspecified safety and robustness evaluation, and (3) narrow modality coverage and interaction paradigms. We