AI RESEARCH

ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models

arXiv CS.LG

ArXi:2510.01290v2 Announce Type: replace The long-output context generation of large reasoning models enables extended chain of thought (CoT) but also drives rapid growth of the key-value (KV) cache, quickly overwhelming GPU memory. To address this challenge, we propose ThinKV, a thought-adaptive KV cache compression framework. ThinKV is based on the observation that attention sparsity reveals distinct thought types with varying importance within the CoT.