$\pi$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling

ArXi:2511.10696v2 Announce Type: replace-cross Transformers have revolutionized natural language processing, but their quadratic complexity with respect to sequence length remains a fundamental bottleneck for long-range modeling. While sparse attention mechanisms like RingAttention reduce computational costs by restricting attention to local neighborhoods, they suffer from limited receptive fields and lack of adaptability.