Second-Order Actor-Critic Methods for Discounted MDPs via Policy Hessian Decomposition

View PDF HTML (experimental)

Abstract:We address the discounted reward setting in reinforcement learning (RL). To mitigate the value approximation challenges in policy gradient methods, actor-critic approaches have been developed and are known to converge to stationary points under suitable assumptions. However, these methods rely on first-order updates. In contrast, second-order optimization provides principled curvature-aware updates that are proven to accelerate convergence, but its application in RL is limited by the computational complexity of Hessian estimation. In this work, we analyze second-order approximations for the actor update that leverage the full curvature information of the objective as much as possible. A stable approximation requires treating the action-value function as locally constant with respect to policy parameters, which does not generally hold in policy gradient methods. We show that this approximation becomes well-justified under a two-timescale actor-critic framework, where the critic evolves on a faster timescale and can be treated as quasi-stationary during actor updates. Building on this insight, we formulate a second-order actor-critic method for the discounted reward setting that leverages Hessian-vector product (HVP) computations, resulting in a computationally efficient and stable second-order update.

Comments:	9 pages, 2 figures including Appendix with Detailed proofs
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2605.14982 [cs.LG]
	(or arXiv:2605.14982v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2605.14982 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Sanjeev Manivannan [view email]
[v1] Thu, 14 May 2026 15:46:27 UTC (1,481 KB)