Before Real Users Break Your ML System, Let Synthetic Data Do It First

Image generated using LLM We spent six weeks building a recommendation model that worked beautifully in offline evaluation. Precision at K was strong. NDCG looked clean. Every metric we tracked in the notebook environment told us we were ready. We deployed to a staging environment, ran a smoke test with twenty synthetic users, confirmed predictions were returning correctly, and scheduled the production rollout for the following Monday. By Monday afternoon, the serving layer was timing out under real traffic. The model itself was fine. The issue was in the feature retrieval pipeline.