AI RESEARCH

RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache

arXiv CS.AI

ArXi:2605.08317v1 Announce Type: cross Large language models (LLMs) have shown strong performance across diverse tasks, but their inference with long input contexts is bottlenecked by memory size and bandwidth. The Key-Value (KV) cache size grows linearly with sequence length and needs to be re-read from off-chip high-bandwidth memory (HBM) to on-chip memory at every decoding step, resulting in memory-bound inference. Existing methods reduce the cache by either eviction or quantization, but typically treat the two in isolation.