CRAFT: Cost-aware Expert Replica Allocation with Fine-Grained Layerwise Estimations

ArXi:2603.28768v1 Announce Type: cross Mixture-of-Experts (MoE) has recently emerged as the mainstream architecture for efficiently scaling large language models while maintaining near-constant computational cost. Expert parallelism distributes parameters by partitioning experts across devices, but this