Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource

ArXi:2506.12119v2 Announce Type: replace Mixture-of-Experts (MoE) language models dramatically expand model capacity and achieve remarkable performance without increasing per-token compute. However, can MoEs surpass dense architectures under strictly equal resource constraints -- that is, when the total parameter count