Multi-Armed Sampling Problem and the End of Exploration

View PDF HTML (experimental)

Abstract:This paper introduces the framework of multi-armed sampling, which serves as the sampling counterpart to the optimization problem of multi-armed bandits. Our primary motivation is to rigorously examine the exploration-exploitation trade-off in the context of sampling. We systematically define plausible notions of regret for this framework and establish corresponding lower bounds. We then propose a simple algorithm that achieves near-optimal regret bounds. Our theoretical results suggest that, in contrast to optimization, sampling barely requires any exploration. To further connect our findings with those of multi-armed bandits, we define a continuous family of problems and associated regret measures that smoothly interpolate and unify multi-armed sampling and multi-armed bandit problems using a temperature parameter. We believe that the multi-armed sampling framework and our findings in this setting can play a foundational role in the study of sampling, including recent neural samplers, much like the role of multi-armed bandits in reinforcement learning. In particular, our work sheds light on the role of exploration (or lack thereof) and the convergence properties of algorithms for entropy-regularized reinforcement learning, fine-tuning of pretrained models and reinforcement learning with human feedback (RLHF).

Comments:	29th International Conference on Artificial Intelligence and Statistics (AISTATS) 2026
Subjects:	Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Cite as:	arXiv:2507.10797 [cs.LG]
	(or arXiv:2507.10797v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2507.10797 arXiv-issued DOI via DataCite

Submission history

From: Mohammad Pedramfar [view email]
[v1] Mon, 14 Jul 2025 20:50:51 UTC (513 KB)
[v2] Tue, 12 May 2026 19:14:11 UTC (571 KB)