Abstract:Muon has recently emerged as a strong optimizer for large language model pre-training, orthogonalizing the momentum matrix via Newton--Schulz polar iterations. A natural intuition is that polar iterations, by flattening the singular spectrum to all ones, should also eliminate column- and row-wise norm imbalance in the update. We show that this is not true in practice: practical polar steps can substantially amplify the imbalance. We term this the post-polar imbalanced update problem, and prove that such imbalance tightens the second-order term in a blockwise descent analysis, weakening Muon's per-step descent guarantee. Motivated by this analysis, we propose Muon+, a one-line fix that inserts a single normalization step after polar orthogonalization. Muon+ adds no optimizer state. Across pre-training experiments on GPT and LLaMA models from 60M to 7B parameters, spanning both compute-optimal budgets and extended token-to-parameter ratios up to approximately 200, Muon+ consistently outperforms Muon in terms of training and validation perplexity, leading to significant overall pre-training speedup.
| Subjects: | Machine Learning (cs.LG) |
| Cite as: | arXiv:2602.21545 [cs.LG] |
| (or arXiv:2602.21545v3 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2602.21545 arXiv-issued DOI via DataCite |
Submission history
From: Ruijie Zhang [view email]
[v1]
Wed, 25 Feb 2026 04:04:00 UTC (3,469 KB)
[v2]
Thu, 26 Feb 2026 17:01:08 UTC (3,469 KB)
[v3]
Thu, 14 May 2026 00:38:30 UTC (9,617 KB)
