AI RESEARCH
Sparse Prefix Caching for Hybrid and Recurrent LLM Serving
arXiv CS.LG
•
ArXi:2605.05219v1 Announce Type: new Prefix caching is a key latency optimization for autoregressive LLM serving, yet existing systems assume dense per-token key/value reuse. State-space models change the structure of the problem: a recurrent layer can resume from a single d state rather than requiring the entire token history. This asymmetry opens a new design point between no reuse and dense caching: exact recurrent states at a sparse set of checkpoint positions and, on a cache hit, resume from the deepest d checkpoint and recompute the remaining suffix exactly.