Abstract:Offline reinforcement learning (RL), which optimizes policies using a previously collected static dataset, is an important branch of RL. A popular and promising approach is to regularize actor-critic methods with behavior cloning (BC), which quickly yields realistic policies and mitigates bias from out-of-distribution actions, but it can impose an often-overlooked performance ceiling: when dataset actions are suboptimal, indiscriminate imitation structurally prevents the actor from fully exploiting better actions suggested by the value function, especially in later training when imitation is already dominant. We formally analyzed this limitation by investigating convergence properties of BC-regularized actor-critic optimization and verified it on a controlled continuous bandit task. To break this ceiling, we propose proximal action replacement (PAR), an easy-to-use plug-and-play training sample replacer. PAR substitutes suboptimal dataset actions with better actions generated by a stable target policy, guided by the action-value function's local ascent direction and bounded by value uncertainty to ensure training stability. PAR is compatible with multiple BC regularization paradigms. Extensive experiments across offline RL benchmarks show that PAR consistently improves performance, and approaches state-of-the-art results simply by being combined with the basic TD3+BC.
| Subjects: | Machine Learning (cs.LG); Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2602.07441 [cs.LG] |
| (or arXiv:2602.07441v2 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2602.07441 arXiv-issued DOI via DataCite |
Submission history
From: Jinzong Dong [view email]
[v1]
Sat, 7 Feb 2026 08:44:27 UTC (522 KB)
[v2]
Thu, 14 May 2026 13:11:00 UTC (2,495 KB)
