Switch Attention: Towards Dynamic and Fine-grained Hybrid Transformers

ArXi:2603.26380v1 Announce Type: new The attention mechanism has been the core component in modern transformer architectures. However, the computation of standard full attention scales quadratically with the sequence length, serving as a major bottleneck in long-context language modeling. Sliding window attention restricts the context length for better efficiency at the cost of narrower receptive fields.