AI RESEARCH

KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving

arXiv CS.AI

ArXi:2605.09735v1 Announce Type: cross Static-graph LLM decoders provide predictable launches, fixed tensor shapes, and low submission overhead, but online decoding exposes highly irregular KV-cache behavior: request lengths differ, EOS events arrive asynchronously, and logical histories fragment over time. Dynamic runtimes recover flexibility through paged KV management and step-level scheduling, while static-graph executors often over-reserve memory and suffer burst-time latency outliers. This paper studies whether much of this variability can be absorbed below a fixed decode interface.