The Eval Trap: Your Benchmark Is Part of Your Product
Towards AI
•
Generative AI
AI Research
AI evals are becoming increasingly necessary and common, but improper benchmark design will fail to reveal how the system will behave in production, while giving you a false sense of stability. Below is the specific failure I encountered in this process and the steps I took to mitigate it. In the past few months I ran 220,000 evaluations on LLM-based phishing detection across 11 frontier models and 10 different system prompt configurations.