AI RESEARCH
Relay Buffer Independent Communication over Pooled HBM for Efficient MoE Inference on Ascend
arXiv CS.LG
•
ArXi:2605.06055v1 Announce Type: cross Mixture-of-Experts (MoE) inference requires large-scale token exchange across devices, making dispatch and combine major bottlenecks in both prefill and decode. Beyond network transfer, routing-driven layout transformation, temporary relay, and output restoration can add substantial overhead. Existing MoE communication paths are often buffer-centric, using explicit inter-process relay and reordering buffers around collective transfer. This report presents a relay-buffer-free communication design for MoE inference acceleration on Ascend systems.