EMO: Frustratingly Easy Progressive Training of Extendable MoE

ArXi:2605.13247v1 Announce Type: new Sparse Mixture-of-Experts (MoE) models offer a powerful way to scale model size without increasing compute, as per-token FLOPs depend only on k active experts rather than the total pool of E experts. Yet, this asymmetry creates an MoE efficiency paradox in practice: adding experts balloons memory and communication costs, making actual