AI RESEARCH
MoBiE: Efficient Inference of Mixture of Binary Experts under Post-Training Quantization
arXiv CS.LG
•
ArXi:2604.06798v1 Announce Type: new Mixture-of-Experts (MoE) based large language models (LLMs) offer strong performance but suffer from high memory and computation costs. Weight binarization provides extreme efficiency, yet existing binary methods designed for dense LLMs struggle with MoE-specific issues, including cross-expert redundancy, task-agnostic importance estimation, and quantization-induced routing shifts. To this end, we propose MoBiE, the first binarization framework tailored for MoE-based LLMs. MoBiE is built on three core innovations: 1.