LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation

ArXi:2603.10899v1 Announce Type: cross Transformer-based large language models (LLMs) rely on key-value (KV) caching to avoid redundant computation during autoregressive inference. While this mechanism greatly improves efficiency, the cache size grows linearly with the input sequence length, quickly becoming a bottleneck for long-context tasks. Existing solutions mitigate this problem by evicting prompt KV that are deemed unimportant, guided by estimated importance scores.