AI RESEARCH
Sharpness-Guided Group Relative Policy Optimization via Probability Shaping
arXiv CS.LG
•
ArXi:2511.00066v4 Announce Type: replace Reinforcement learning with verifiable rewards (RLVR) has become a practical route to improve large language model reasoning, and Group Relative Policy Optimization (GRPO) is a widely used optimizer in this setting. However, RLVR