AI RESEARCH
GEM: GPU-Variability-Aware Expert to GPU Mapping for MoE Systems
arXiv CS.AI
•
ArXi:2605.19945v1 Announce Type: cross Mixture-of-Expert (MoE) models enable efficient inference by employing smaller experts and activating only a subset of them per token. MoE serving engines distribute experts across multiple GPUs and route tokens to appropriate GPUs at inference time based on experts activated. They process tokens in lock-step fashion, where tokens within a batch must finish processing before proceeding to the next layer. This synchronization barrier acts as a critical bottleneck because the performance of MoE models is limited by the straggler GPU that finishes last.