Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents

ArXi:2510.02837v2 Announce Type: replace-cross Although recent tool-augmented benchmarks involve complex requests, evaluation remains limited to answer matching, neglecting critical trajectory aspects like efficiency, hallucination, and adaptivity. The most straightforward method for evaluation is to compare an agent's trajectory with the ground-truth, but annotating all valid ground-truth trajectories is prohibitively expensive. In this manner, we