FlashOverlap: Minimizing Tail Latency in Communication Overlap for Distributed LLM Training

ArXi:2604.24013v1 Announce Type: new The rapid growth in the size of large language models has necessitated the partitioning of computational workloads across accelerators such as GPUs, TPUs, and NPUs. However, these parallelization strategies incur substantial data communication overhead significantly hindering computational efficiency. While communication-computation overlap presents a promising direction, existing data slicing based solutions suffer from tail latency. To overcome this limitation, this research