Quantitative Certification of Agentic Tool Selection

ArXi:2510.03992v2 Announce Type: replace-cross Large language models (LLMs) are increasingly deployed in agentic systems, where a fundamental task is mapping user intents to relevant external tools. Errors in tool selection can have severe outcomes, such as unauthorized data access, even without modifying the agent's underlying model. Existing evaluations measure performance on curated, benign benchmarks. However, a pipeline's behavior in deployment depends on the tool pool the agent actually encounters, which in open registries is shaped by third parties. We