Inside LLM Inference: KV Cache, Prefill, and the Decode Bottleneck

Towards AI
Generative AI

How LLMs reuse history, read memory, and transform computation into data movement KV cache cuts latency In the previous article, we saw that LLM inference slows down as context grows. Not just because there is computation, but because the system itself shifts from being compute-bound to memory-bound. That raises a natural question. If memory access is now the bottleneck, what does that actually look like inside the model? What is the system doing at each step, and why does the cost keep growing? From the outside, this process looks simple. You send a prompt.