AI RESEARCH

SOCKET: SOft Collision Kernel EsTimator for Sparse Attention

arXiv CS.LG

ArXi:2602.06283v2 Announce Type: replace Exploiting sparsity during long-context inference is key to scaling large language models, as attention dominates the cost of autoregressive decoding. Sparse attention reduces this cost by restricting computation to a subset of tokens, but its effectiveness depends on efficient scoring and selection at inference time. We revisit Locality-Sensitive Hashing (LSH) and