AI RESEARCH

KVDrive: A Holistic Multi-Tier KV Cache Management System for Long-Context LLM Inference

arXiv CS.CL

ArXi:2605.18071v1 Announce Type: new ing long-context LLMs is challenging due to the substantial memory demands of the key-value (KV) cache. Existing offloading systems the full cache in host memory and selectively fetch critical entries during decoding, but this strategy quickly hits a ceiling: sparsity cannot be pushed further without degrading accuracy. As a result, when context length and batch size grow, the volume of KV transfers rises sharply and becomes the dominant source of decoding latency.