Long-Tailed Distribution-Aware Router For Mixture-of-Experts in Large Vision-Language Model

ArXi:2507.01351v2 Announce Type: replace The mixture-of-experts (MoE) architecture, which replaces dense networks with sparse ones, has attracted significant attention in large vision-language models (LVLMs) for achieving comparable performance while activating far fewer parameters. Existing MoE architectures for LVLMs primarily focus on token-to-expert routing (TER), encouraging different experts to specialize in processing specific tokens.