Authors:Yihong Wu, Liheng Ma, Lei Ding, Muzhi Li, Xinyu Wang, Kejia Chen, Zhan Su, Zhanguang Zhang, Chenyang Huang, Yingxue Zhang, Mark Coates, Jian-Yun Nie
Abstract:GRPO has emerged as a prominent reinforcement learning algorithm for post-training LLMs. Unlike critic-based methods, GRPO computes advantages by estimating the \emph{value baselines} from group-level statistics, eliminating the need for a critic network. Consequently, the prevailing view emphasizes the necessity of large group sizes, which are assumed to yield more accurate statistical estimates. In this paper, we propose a different view that the efficacy of GRPO stems from its implicit contrastive objective in the optimization, which helps reduce variance via the control variate method. This makes GRPO structurally related to preference learning methods such as DPO. This perspective motivates 2-GRPO, a minimal group-size variant that constructs contrastive signals with only two rollouts. We provide a rigorous theoretical analysis of 2-GRPO and empirically validate its effectiveness: 2-GRPO retains $97.6\%$ of the performance of 16-GRPO, while requiring only $12.5\%$ of the rollouts and $21\%$ of the training time.
| Subjects: | Machine Learning (cs.LG); Computation and Language (cs.CL) |
| Cite as: | arXiv:2510.00977 [cs.LG] |
| (or arXiv:2510.00977v3 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2510.00977 arXiv-issued DOI via DataCite |
Submission history
From: Yihong Wu [view email]
[v1]
Wed, 1 Oct 2025 14:52:11 UTC (92 KB)
[v2]
Fri, 30 Jan 2026 01:26:19 UTC (122 KB)
[v3]
Thu, 14 May 2026 04:25:16 UTC (435 KB)
