Abstract:This paper introduces the framework of multi-armed sampling, which serves as the sampling counterpart to the optimization problem of multi-armed bandits. Our primary motivation is to rigorously examine the exploration-exploitation trade-off in the context of sampling. We systematically define plausible notions of regret for this framework and establish corresponding lower bounds. We then propose a simple algorithm that achieves near-optimal regret bounds. Our theoretical results suggest that, in contrast to optimization, sampling barely requires any exploration. To further connect our findings with those of multi-armed bandits, we define a continuous family of problems and associated regret measures that smoothly interpolate and unify multi-armed sampling and multi-armed bandit problems using a temperature parameter. We believe that the multi-armed sampling framework and our findings in this setting can play a foundational role in the study of sampling, including recent neural samplers, much like the role of multi-armed bandits in reinforcement learning. In particular, our work sheds light on the role of exploration (or lack thereof) and the convergence properties of algorithms for entropy-regularized reinforcement learning, fine-tuning of pretrained models and reinforcement learning with human feedback (RLHF).
| Comments: | 29th International Conference on Artificial Intelligence and Statistics (AISTATS) 2026 |
| Subjects: | Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML) |
| Cite as: | arXiv:2507.10797 [cs.LG] |
| (or arXiv:2507.10797v2 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2507.10797 arXiv-issued DOI via DataCite |
Submission history
From: Mohammad Pedramfar [view email]
[v1]
Mon, 14 Jul 2025 20:50:51 UTC (513 KB)
[v2]
Tue, 12 May 2026 19:14:11 UTC (571 KB)
