Sharpness-Guided Group Relative Policy Optimization via Probability Shaping

View PDF HTML (experimental)

Abstract:Reinforcement learning with verifiable rewards (RLVR) has become a practical route to improve large language model reasoning, and Group Relative Policy Optimization (GRPO) is a widely used optimizer in this setting. However, RLVR training is typically performed with limited control over generalization. We revisit GRPO through a robustness-based generalization view, where the generalization loss is upper bounded by a combination of the empirical loss and a sharpness surrogate measured by the gradient norm. Building on this perspective, we propose Sharpness-Guided GRPO (GRPO-SG), a simple token-weighted variant of GRPO that downweights tokens likely to cause overly large gradients, reducing sharp updates and stabilizing optimization, thereby improving generalization. Experiments across mathematical reasoning, logic puzzles and tool-augmented question answering show consistent improvements over GRPO, along with smoother gradient-norm trajectories, supporting GRPO-SG as a simple and effective generalization-oriented upgrade to GRPO for RLVR.

Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2511.00066 [cs.LG]
	(or arXiv:2511.00066v4 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2511.00066 arXiv-issued DOI via DataCite

Submission history

From: Tue Le [view email]
[v1] Wed, 29 Oct 2025 08:07:47 UTC (998 KB)
[v2] Mon, 22 Dec 2025 08:49:46 UTC (997 KB)
[v3] Fri, 30 Jan 2026 06:54:26 UTC (1,662 KB)
[v4] Wed, 13 May 2026 07:57:37 UTC (1,654 KB)