Authors:Leyi Pan, Shuchang Tao, Yunpeng Zhai, Zheyu Fu, Liancheng Fang, Minghua He, Lingzhe Zhang, Zhaoyang Liu, Bolin Ding, Aiwei Liu, Lijie Wen
Abstract:Reinforcement learning (RL) is pivotal for enhancing the reasoning capabilities of diffusion large language models (dLLMs). However, existing dLLM policy optimization methods suffer from two critical reliability bottlenecks: (1) reward sparsity, arising from coarse or unverifiable signals that impede accurate advantage calculation; and (2) their probability estimates do not account for the gap to the unbiased expectation over all decoding orders, which are intractable to compute. To mitigate these issues, we propose d-TreeRPO, a reliable RL framework for dLLMs that leverages tree-structured rollouts and bottom-up advantage computation based on verifiable outcome rewards to provide fine-grained and verifiable step-wise reward signals. Furthermore, we provide a theoretical proof demonstrating that increasing prediction confidence effectively minimizes the gap between unbiased expected prediction probabilities and its single-step forward pass estimate. Guided by this analysis, we introduce a time-scheduled self-distillation loss during training that enhances prediction confidence in later training stages, thereby enabling more accurate probability estimation and better performance. Experiments demonstrate that d-TreeRPO outperforms existing baselines and achieves significant improvements across multiple reasoning benchmarks. Specifically, it achieves +86.2% on Sudoku, +51.6% on Countdown, +4.5% on GSM8K, and +5.3% on Math500 compared to the base model.
| Comments: | ACL 2026 Main |
| Subjects: | Computation and Language (cs.CL) |
| MSC classes: | 68T50 |
| ACM classes: | I.2.7 |
| Cite as: | arXiv:2512.09675 [cs.CL] |
| (or arXiv:2512.09675v3 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2512.09675 arXiv-issued DOI via DataCite |
Submission history
From: Leyi Pan [view email]
[v1]
Wed, 10 Dec 2025 14:20:07 UTC (878 KB)
[v2]
Tue, 6 Jan 2026 08:58:43 UTC (2,314 KB)
[v3]
Wed, 13 May 2026 05:10:13 UTC (2,564 KB)
