Hierarchical Mixture-of-Experts with Two-Stage Optimization

View PDF HTML (experimental)

Abstract:Sparse Mixture-of-Experts (MoE) models scale capacity by routing each token to a small subset of experts. However, their routers exhibit a fundamental trade-off: strong load balancing can suppress expert specialization, while aggressive diversity often causes routing collapse. We propose Hi-MoE, a grouped MoE framework that decomposes routing control into two coupled levels: (i) inter-group balancing that enforces fair traffic across expert groups, and (ii) intra-group specialization that promotes complementary expert behaviors while preventing within-group collapse. Our analysis provides a principled explanation of how our hierarchical objectives reshape the router, thereby promoting stable specialization and mitigating collapse. We observe consistent improvements over recent sparse-routing and grouped-MoE baselines across NLP and vision benchmarks, and confirm robustness via scaling studies (model size, expert count) and targeted ablations. In large-scale pre-training on 58B tokens, Hi-MoE-7B achieves a 5.6% perplexity reduction and a 40% improvement in expert balance over OLMoE-7B across diverse evaluation domains.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
Cite as:	arXiv:2605.08292 [cs.LG]
	(or arXiv:2605.08292v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2605.08292 arXiv-issued DOI via DataCite

Submission history

From: Gleb Molodtsov Mr [view email]
[v1] Fri, 8 May 2026 09:21:46 UTC (974 KB)