Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection

ArXi:2602.03216v2 Announce Type: replace-cross The quadratic complexity of attention remains the central bottleneck in long-context inference for large language models. Prior acceleration methods either sparsify the attention map with structured patterns or permanently evict tokens at specific layers, which can retain irrelevant tokens or rely on irreversible early decisions despite the layer-/head-wise dynamics of token importance.