Has increasing the number of experts used in MoE models ever meaningfully helped?

I remember there was a lot of debate as to whether or not this was worthwhile back when Qwen3-30B-A3B came out. A few people even swore by "Qwen3-30b-A6B" for a short while. It's still an easy configuration in Llama-CPP, but I don't really see any experimentation with it anymore. Has anyone been testing around with this much? submitted by /u/ForsookComparison [link] [comments]