Post-Trained MoE Can Skip Half Experts via Self-Distillation

ArXi:2605.18643v1 Announce Type: new Mixture-of-Experts (MoE) scales language models efficiently through sparse expert activation, and its dynamic variant further reduces computation by adjusting the activated experts in an input-dependent manner. Existing dynamic MoE methods usually rely on pre-