MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training

ArXi:2510.18830v2 Announce Type: replace-cross The adoption of long context windows has become a standard feature in Large Language Models (LLMs), as extended contexts significantly enhance their capacity for complex reasoning and broaden their applicability across diverse scenarios. Dynamic sparse attention is a promising approach for reducing the computational cost of long-context. However, efficiently