How Our AI Inference Cluster Almost Lost a $20K/Month Client

What a GPU memory leak and missing circuit breakers taught us about production AI infrastructure. I’m going to tell you about the worst fifteen minutes of my professional life first. Then I’m going to show you exactly what I built so it never happens again. The was for a room full of enterprise stakeholders. One of those moments where six months of work gets judged in thirty minutes. Our multi-model pipeline three models running in sequence on Triton, orchestrated through Ray Serve, sitting on EKS, had been rock solid in staging for two weeks. We’d run load tests. We’d done dry runs.