FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference

ArXi:2505.13109v5 Announce Type: replace Large language models (LLMs) are widely deployed with rapidly expanding context windows to increasingly demanding applications. However, long contexts pose significant deployment challenges, primarily due to the KV cache whose size grows proportionally with context length. While KV cache compression methods have been proposed to address this issue, KV dropping methods incur considerable accuracy loss, and KV retrieval methods suffer from significant efficiency bottlenecks. We propose FreeKV, a