RACE Attention: A Strictly Linear-Time Attention Layer for Training on Outrageously Large Contexts

ArXi:2510.04008v5 Announce Type: replace Softmax Attention has a quadratic time complexity in sequence length, which becomes prohibitive to run at long contexts, even with highly optimized GPU kernels. For example, FlashAttention-2/3 (exact, GPU-optimized implementations of Softmax Attention) cannot complete a single forward-backward pass of a single attention layer once the context exceeds ~4M tokens on an NVIDIA GH200 (96 GB). We