AI RESEARCH
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
arXiv CS.CL
•
ArXi:2605.06650v1 Announce Type: new Reinforcement learning with verifiable rewards (RLVR), due to the deterministic verification, becomes a dominant paradigm for enhancing the reasoning ability of large language models (LLMs). The community witnesses the rapid change from the Proximal Policy Optimization (PPO) to Group Relative Policy Optimization (GRPO), in which GRPO reduces the complicated advantage estimation with simple estimation over grouped positive and negative rollouts.