AI RESEARCH

Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection

arXiv CS.LG

ArXi:2411.08982v3 Announce Type: replace Selective parameter activation provided by Mixture-of-Expert (MoE) models have made them a popular choice in modern foundational models. However, MoEs face a fundamental tension when employed for serving. Batching, critical for performance in serving, forces the activation of all experts, thereby negating MoEs' benefits and exacerbating memory bandwidth bottlenecks. Existing work on efficient MoE inference are unable to resolve this tension even with extensive workload-specific tuning.