TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference

ArXi:2505.11329v5 Announce Type: replace-cross Distributed inference of large language models (LLMs) using tensor parallelism can We present TokenWeave, the first system to enable efficient compute-communication overlap for tensor-parallel model inference for token lengths as small as 1024. TokenWeave identifies RMSNorm, a previously overlooked operation, as crucial and optimizes it along with communication by implementing a novel fused AllReduce--RMSNorm kernel.