Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers

View PDF HTML (experimental)

Abstract:Training large language models requires optimization algorithms that are not only statistically effective, but also computationally and memory efficient at extreme scale. Although Adam remains the dominant optimizer for large-scale language-model pretraining and fine-tuning, recent work has revisited nearly every component of the optimization stack: adaptive moment estimation, decoupled weight decay, memory footprint, curvature approximation, sign-based updates, large-batch stability, low-rank gradient structure, and matrix-wise orthogonalized updates. This survey reviews optimizer design for large language models through a systems-and-optimization lens. We organize the literature into classical first-order optimizers, adaptive optimizers, memory-efficient variants, second-order and curvature-aware methods, sign-based and discovered optimizers, low-rank and projection-based methods, and matrix-based optimizers such as Muon. We also discuss benchmarking methodology, including hyperparameter fairness, scale dependence, wall-clock efficiency, token efficiency, memory overhead, and downstream evaluation. We argue that optimizer research for LLMs is entering a new phase: moving from single-algorithm speedup claims toward rigorous, scale-aware comparisons that jointly evaluate convergence, stability, memory, and implementation complexity.

Comments:	No figures, 65 pages
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2605.09176 [cs.LG]
	(or arXiv:2605.09176v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2605.09176 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Aditya Ranganath [view email]
[v1] Sat, 9 May 2026 21:34:28 UTC (63 KB)