Expert Upcycling: Growing MoE capacity mid-training without increasing inference cost (7B→13B, ~32% GPU hours saved)

r/LocalLLaMA
AI Hardware AI Research

Author here, sharing a preprint we recently released. We're actively looking for feedback from this community before we revise. Motivation