EMO: Mixture-of-Experts That Actually Behaves Like One

Most MoE models are just big transformers with a traffic cop attached. The router directs tokens to different experts, sure, but ask for just the code experts and the whole thing falls apart. That's not modularity. That's sharding with extra steps. The problem isn't that MoE doesn't work. It's that the experts don't specialize where it matters. Open up a standard MoE and you'll find one expert handling prepositions, another managing punctuation, a third dealing with numbers. The specialization is lexical, not semantic.