Abstract:\emph{Kullback-Leibler} (KL) regularization is ubiquitous in reinforcement learning algorithms in the form of \emph{reverse} or \emph{forward} KL. Recent studies have demonstrated $\epsilon^{-1}$-type fast rates for decision making under reverse KL regularization, in contrast to the standard $\epsilon^{-2}$-type sample complexity. However, for forward-KL-regularized objectives, existing statistical analyses are either not applicable or result in $\tilde{O}(\epsilon^{-2})$ slow rates. We take the first step towards addressing this problem via a streamlined analysis of forward-KL-regularized offline CBs. We give the first $\tilde{O}(\epsilon^{-1})$ upper bounds in tabular and general function approximation settings, both under notions of \emph{single-policy concentrability}. In particular, our convex-analytical pipeline unifies these settings by exploiting the pessimism principle in a novel way and completely bypasses the proof routines in previous works based on the mean value theorem, which might be of independent interest. Moreover, we provide rate-optimal lower bounds, manifesting the tightness of our upper bounds in terms of statistical rates. Our lower bounds also demonstrate that the forward-KL-regularized sample complexity recovers the unregularized slow rate in the low-regularization regime, similarly to the reverse-KL regularization.
| Comments: | 31 pages, comments are welcome |
| Subjects: | Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Statistics Theory (math.ST); Machine Learning (stat.ML) |
| Cite as: | arXiv:2605.09214 [cs.LG] |
| (or arXiv:2605.09214v1 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2605.09214 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Qingyue Zhao [view email]
[v1]
Sat, 9 May 2026 23:17:46 UTC (52 KB)
