[D] Why evaluating only final outputs is misleading for local LLM agents

Been running local agents with Ollama + LangChain lately and noticed something kind of uncomfortable - you can get a completely correct final answer while the agent is doing absolute nonsense internally. I’m talking about stuff like calling the wrong tool first and then “recovering,” using tools it didn’t need at all, looping a few times before converging, or even getting dangerously close to calling something it shouldn’t. And if you’re only checking the final output, all of that just… passes. It made me realize that for agents, the output is almost the least interesting part.