Exploring how KV cache architecture has evolved - model architectures that are selective about what to remember help avoid context rot

I went deep on KV cache recently and found the progression across architectures fascinating once you look at the actual numbers side by side. Sebastian Raschka's LLM Architecture Gallery has per-token KV cache costs for dozens of model families. The trajectory: • GPT-2: 300 KiB/token. Multi-head attention, every head maintains its own keys and values. No sharing. A 4,000-token conversation = ~1.2 GB of GPU memory just for the cache, separate from the model weights. • Llama 3: 128 KiB/token. Grouped-query attention, where multiple query heads share the same KV pairs.