Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance

View PDF HTML (experimental)

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has achieved great success in developing Large Language Models (LLMs) with chain-of-thought rollouts for many tasks such as math and coding. Nevertheless, RLVR struggles with sample efficiency on difficult problems where correct rollouts are hard to generate. Prior works propose to address this issue via demonstration-guided RLVR, i.e., to conduct Supervised FineTuning (SFT) when RL fails; however, SFT often requires a lot of data, which can be expensive to acquire. In this paper, we propose FEST, a FEw-ShoT demonstration-guided RLVR algorithm. It attains compelling results with only 128 demonstrations randomly selected from an SFT dataset. We find that three components are vital for the success: supervised signal, on-policy signal, and decaying weights on the few-shot SFT dataset to prevent overfitting from multiple-epoch training. On several benchmarks, FEST outperforms baselines with magnitudes less SFT data, even matching their performance with full dataset.

Comments:	25 pages, 11 figures
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2605.15012 [cs.LG]
	(or arXiv:2605.15012v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2605.15012 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Kai Yan [view email]
[v1] Thu, 14 May 2026 16:12:30 UTC (694 KB)