How to diagnose where your RAG agent fabricates: an open-source A/B eval workflow with cross-lab blind judges

TL;DR: I caught my own RAG agent telling a customer with a severe nut allergy which dishes were "safe" from a menu with no allergen tagging. The pattern is universal: when retrieval can't fully answer a question, the agent pattern-matches a plausible answer instead of admitting the gap. I built an open-source eval workflow that diagnoses this in any RAG agent. Two identical agent producers, only one with a runtime tool wired in, four blind judges from four different labs, a deterministic aggregator, and a synthesizer agent. Repo at the end.