How do you test your LLM agents before shipping changes?

Genuinely curious how other engineers are handling this. Every time I change a prompt, swap a model, or tweak a tool, I've struggled to get a reliable answer to a simple question: did the agent get better or worse overall? The challenge I keep hitting is that aggregate metrics (average success rate, total tokens) usually look fine, but specific task types silently break. The easy tasks improve, masking the regressions on the hard ones. By the time someone notices, it's already in production.