Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning

ArXi:2412.00069v3 Announce Type: replace Mixture-of-Experts (MoE) has garnered significant attention for its ability to scale up neural networks while utilizing the same or even fewer active parameters. However, MoE does not alleviate the massive memory requirements of networks, which limits their practicality in real-world applications, especially in the era of large language models (LLMs). While recent work explores the possibility of removing entire layers of MoE to reduce memory, the performance degradation is still notable.