Abstract:Distributional reinforcement learning (DRL) models the full return distribution, but existing finite-support or quantile-based methods rely on projections, while recent flow-based approaches can suffer from \emph{boundary mismatch} at the flow source or from \emph{high-variance} bootstrapping when current and successor noises are independent. We propose Path-Coupled Bellman Flows (PCBF), a continuous-time DRL method that learns return distributions with flow matching using \textbf{source-consistent Bellman-coupled paths}: the current path starts from the required base prior at $t{=}0$, reaches the Bellman target at $t{=}1$, and maintains a pathwise affine relation to the successor flow at intermediate times (without requiring time-$t$ marginals to satisfy a distributional Bellman fixed point for all $t$). PCBF couples current and successor return flows through shared base noise and uses a $\lambda$-parameterized control-variate target: $\lambda{=}0$ recovers an unbiased sample Bellman target, while $\lambda{>}0$ trades controlled bias for variance reduction. Experiments on analytically tractable MRPs, OGBench, and D4RL show improved distributional fidelity and training stability, and competitive offline RL performance.
| Comments: | Accepted to the 43rd International Conference on Machine Learning (ICML 2026) |
| Subjects: | Machine Learning (cs.LG); Artificial Intelligence (cs.AI) |
| ACM classes: | I.2.6 |
| Cite as: | arXiv:2605.08253 [cs.LG] |
| (or arXiv:2605.08253v1 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2605.08253 arXiv-issued DOI via DataCite |
|
| Journal reference: | Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026 |
Submission history
From: Hao Yan [view email]
[v1]
Thu, 7 May 2026 19:05:01 UTC (8,030 KB)
