Building a Production LLM Evaluation Harness in Pytest: Cost-Bounded, Flake-Aware, CI-Gated (Runnable Python)

I shipped my fourth LLM agent to production last quarter. By month two, the eval suite that "passed in CI" was the reason a regression made it to a customer. The tests were green. But they were green for the wrong reason - every assertion was a single LLM call against a single golden answer, on a model whose temperature happened to land in our favor that day. We had built a coin flip and called it a test. This article is the harness I wish I'd had on day one.