Abstract:Sparse Mixture-of-Experts (MoE) models scale capacity by routing each token to a small subset of experts. However, their routers exhibit a fundamental trade-off: strong load balancing can suppress expert specialization, while aggressive diversity often causes routing collapse. We propose Hi-MoE, a grouped MoE framework that decomposes routing control into two coupled levels: (i) inter-group balancing that enforces fair traffic across expert groups, and (ii) intra-group specialization that promotes complementary expert behaviors while preventing within-group collapse. Our analysis provides a principled explanation of how our hierarchical objectives reshape the router, thereby promoting stable specialization and mitigating collapse. We observe consistent improvements over recent sparse-routing and grouped-MoE baselines across NLP and vision benchmarks, and confirm robustness via scaling studies (model size, expert count) and targeted ablations. In large-scale pre-training on 58B tokens, Hi-MoE-7B achieves a 5.6% perplexity reduction and a 40% improvement in expert balance over OLMoE-7B across diverse evaluation domains.
| Subjects: | Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC) |
| Cite as: | arXiv:2605.08292 [cs.LG] |
| (or arXiv:2605.08292v1 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2605.08292 arXiv-issued DOI via DataCite |
Submission history
From: Gleb Molodtsov Mr [view email]
[v1]
Fri, 8 May 2026 09:21:46 UTC (974 KB)
