From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

ArXi:2605.15104v1 Announce Type: new Voice agents increasingly require reliable tool use from speech, whereas prominent tool-calling benchmarks remain text-based. We study whether verified text benchmarks can be converted into controlled audio-based tool calling evaluations without re-annotating the tool schema and gold labels. Our dataset-agnostic framework uses text-to-speech, speaker variation, and environmental noise to create paired text-audio instances while preserving the original dataset annotations.