AI RESEARCH

SkipKV: Selective Skipping of KV Generation and Storage for Efficient Inference with Large Reasoning Models

arXiv CS.AI

ArXi:2512.07993v2 Announce Type: replace Large reasoning models (LRMs) often incur significant key-value (KV) cache overhead, due to their linear growth with the verbose chain-of-thought (CoT) reasoning. This incurs both memory overhead and throughput bottlenecks, limiting efficient deployment. To reduce KV cache size during inference, we first investigate the effectiveness of existing KV cache eviction methods for CoT reasoning.