The Eval Trap: Your Benchmark Is Part of Your Product

AI evals are becoming increasingly necessary and common, but improper benchmark design will fail to reveal how the system will behave in production, while giving you a false sense of stability. Below is the specific failure I encountered in this process and the steps I took to mitigate it. In the past few months I ran 220,000 evaluations on LLM-based phishing detection across 11 frontier models and 10 different system prompt configurations.