DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

ArXi:2605.18753v1 Announce Type: cross Current hierarchical attention methods, such as NSA and InfLLMv2, select the top-k relevant key-value (KV) blocks based on coarse attention scores and subsequently apply fine-grained softmax attention on the selected tokens. However, the top-k operation assumes the number of relevant tokens for any query is fixed and it precludes the gradient flow between the sparse and dense stages.