how are people actually debugging bad outputs in agent / RAG pipelines?

Been messing around with some agent / RAG pipelines running into cases where everything executes fine (tool calls return expected outputs, parsing works etc.) but final answer is still wrong / slightly off nothing crashes, just bad outputs curious how people are actually debugging this in practice are you: using evals? tracing tools (langsmith etc)? stepping through logs manually? or just accepting some % of bad outputs feels like a lot of cases where nothing technically fails but output is still wrong submitted by /u/YouSlow6554 [link] [comments.