How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization

ArXi:2605.14200v1 Announce Type: new Recent frontier large language models predominantly rely on Mixture-of-Experts (MoE) architectures. Despite empirical progress, there is still no principled understanding of how hyperparameters should scale with network width $N$, expert width $N_e$, number of experts $M$, sparsity $K$, and depth $L$ to ensure both stability and optimal performance at scale.