Analytical FFN-to-MoE Restructuring via Activation Pattern Analysis

ArXi:2502.04416v3 Announce Type: replace Scaling large language models (LLMs) improves performance but significantly increases inference costs, with feed-forward networks (FFNs) consuming the majority of computational resources. While Mixture-of-Experts (MoE) architectures can reduce this cost through sparse activation, restructuring existing dense models into MoEs typically requires extensive re