Abstract:In this paper, we establish last-iterate convergence rates for off-policy actor--critic methods in reinforcement learning. In particular, under a single-loop, single-timescale implementation and a broad class of policy updates, including approximate policy iteration and natural policy gradient methods, we prove the first $\tilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity guarantee for finding an $\epsilon$-optimal policy under minimal assumptions, namely, the existence of a policy that induces an irreducible Markov chain. This stands in stark contrast to the existing literature, where an $\tilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity is achieved only through nested-loop updates and/or under strong, algorithm-dependent assumptions on the policies, such as uniform mixing and uniform exploration. Technically, to address the challenges posed by the coupled update equations arising from the single-loop implementation, as well as the potentially unbounded iterates induced by off-policy learning, our analysis is based on a coupled Lyapunov drift framework. Specifically, we establish a geometric convergence rate for the actor and an $\tilde{\mathcal{O}}(1/T)$ convergence rate for the critic, and combine the two Lyapunov drift inequalities through a cross-domination property. We believe this analytical framework is of independent interest and may be applicable to other coupled iterative algorithms with unbounded
| Subjects: | Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML) |
| Cite as: | arXiv:2605.13639 [cs.LG] |
| (or arXiv:2605.13639v1 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2605.13639 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Ishaq Hamza [view email]
[v1]
Wed, 13 May 2026 15:04:59 UTC (45 KB)
