STS: Efficient Sparse Attention with Speculative Token Sparsity

ArXi:2605.15508v1 Announce Type: new The quadratic complexity of attention imposes severe memory and computational bottlenecks on Large Language Model (LLM) inference. This challenge is particularly acute for emerging agentic applications that require processing multi-million token sequences. We propose STS, a sparse attention mechanism that requires no model re