AI RESEARCH
SOCKET: SOft Collision Kernel EsTimator for Sparse Attention
arXiv CS.LG
•
ArXi:2602.06283v2 Announce Type: replace Exploiting sparsity during long-context inference is key to scaling large language models, as attention dominates the cost of autoregressive decoding. Sparse attention reduces this cost by restricting computation to a subset of tokens, but its effectiveness depends on efficient scoring and selection at inference time. We revisit Locality-Sensitive Hashing (LSH) and