DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning

ArXi:2510.09883v2 Announce Type: replace-cross Large reasoning models (LRMs) achieve state-of-the-art performance on challenging benchmarks by generating long chains of intermediate steps, but their inference cost is dominated by decoding, where each new token must attend to the entire growing sequence. One approach to reduce this latency is to evict entries from the key-value (KV) cache, thereby reducing the active context used in attention computation.