TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning

ArXi:2605.09544v1 Announce Type: new Tool-integrated reasoning has emerged as a promising paradigm for enhancing large language models with external computation, retrieval, and execution capabilities. However, the field still lacks a high-quality and unified evaluation benchmark, and existing TIR evaluations remain limited in dataset quality, task diversity, diagnostic comprehensiveness, and evaluation efficiency. In this work, we