AI RESEARCH

DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference

arXiv CS.AI

ArXi:2604.24647v1 Announce Type: cross Long-context reasoning is a critical capability of large language models (LLMs), enabling applications such as long-document understanding, summarization, and code generation. However, efficient autoregressive inference relies on the key-value (KV) cache, whose memory footprint grows linearly with sequence length, leading to a major memory bottleneck. To mitigate this overhead, KV cache pruning methods discard cached tokens with low attention scores during inference.