Inside LLM Inference: When the KV Cache No Longer Fits

Why managing the KV cache is a systems problem, not a model problem In the previous article, we reduced the KV cache. Techniques like MQA and GQA reduce how much we per token. But reduction only delays the problem. It does not solve it. That analysis, however, focused on a single request. Real systems do not serve one request at a time. Think about what actually happens when a product like ChatGPT serves users. At any given moment, hundreds of conversations are running on the same GPU. Each with its own prompt. Each with its own conversation history.