How do you objectively tell if your custom agent tools are actually better?

I've been running Qwen3.6-35B-A3B locally in pi agent and hit cat spam problem. Agent just ignore read tool and the model gets stuck reading the same file 3-4 times using cat, or dumping entire 2k-line logs instead of grepping. I write custom tool for replacement. Feels like it helped. The agent makes fewer calls, doesn't re-read the same file blindly, and tasks seem to finish faster. But I have zero objective way to know if it's actually better. Maybe I'm just cherry-picking the tasks where it works.