Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts

ArXi:2604.19835v1 Announce Type: new Mixture-of-Experts (MoE) has become the dominant architecture for scaling large language models: frontier models routinely decouple total parameters from per-token computation through sparse expert routing. Scaling laws show that under fixed active computation, model quality scales predictably with total parameters, and MoEs realize this by increasing expert count. However