RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference

ArXi:2505.02922v3 Announce Type: replace Recent large language models (LLMs) are rapidly extending their context windows, yet inference throughput lags due to increasing GPU memory and bandwidth demands. This is because the key-value (KV) cache, an intermediate structure storing token representations, grows linearly with context length and requires an iterative linear scan for attention computation.