AI RESEARCH

KV Cache Transform Coding for Compact Storage in LLM Inference

arXiv CS.AI

ArXi:2511.01815v2 Announce Type: replace-cross Serving large language models (LLMs) at scale necessitates efficient key-value (KV) cache management. KV caches can be reused across conversation turns via shared-prefix prompts that are common in iterative code editing and chat. However, stale caches consume scarce GPU memory, require offloading, or force recomputation. We present KVTC, a lightweight transform coder that compresses KV caches for compact on-GPU and off-GPU storage.