Abstract:Reinforcement learning with verifiable rewards (RLVR) has become a practical route to improve large language model reasoning, and Group Relative Policy Optimization (GRPO) is a widely used optimizer in this setting. However, RLVR training is typically performed with limited control over generalization. We revisit GRPO through a robustness-based generalization view, where the generalization loss is upper bounded by a combination of the empirical loss and a sharpness surrogate measured by the gradient norm. Building on this perspective, we propose Sharpness-Guided GRPO (GRPO-SG), a simple token-weighted variant of GRPO that downweights tokens likely to cause overly large gradients, reducing sharp updates and stabilizing optimization, thereby improving generalization. Experiments across mathematical reasoning, logic puzzles and tool-augmented question answering show consistent improvements over GRPO, along with smoother gradient-norm trajectories, supporting GRPO-SG as a simple and effective generalization-oriented upgrade to GRPO for RLVR.
| Subjects: | Machine Learning (cs.LG) |
| Cite as: | arXiv:2511.00066 [cs.LG] |
| (or arXiv:2511.00066v4 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2511.00066 arXiv-issued DOI via DataCite |
Submission history
From: Tue Le [view email]
[v1]
Wed, 29 Oct 2025 08:07:47 UTC (998 KB)
[v2]
Mon, 22 Dec 2025 08:49:46 UTC (997 KB)
[v3]
Fri, 30 Jan 2026 06:54:26 UTC (1,662 KB)
[v4]
Wed, 13 May 2026 07:57:37 UTC (1,654 KB)
