Everything You Know About Scaling Web Apps Breaks When You Serve an LLM

Dev.to AI
Generative AI

Most platform engineers already know how to scale a web app. Put it in a container. Deploy it on Kubernetes. Add CPU and memory requests. Put a Service or Ingress in front. Configure HPA. Watch p95 latency, error rate, CPU, memory, and request throughput. Add replicas when traffic goes up. This is Part 1 of a practical series on hosting large LLMs on Kubernetes. That playbook works for a lot of services. Then you try to serve a large language model, and suddenly the old model starts cracking. Memory does not just mean RAM. Latency is not one number.