Noise-corrected GRPO: From Noisy Rewards to Unbiased Gradients

ArXi:2510.18924v3 Announce Type: replace-cross Reinforcement learning from human feedback (RLHF) or verifiable rewards (RLVR), the standard paradigm for aligning LLMs or building recent SOTA reasoning models, is highly sensitive to noise from inconsistent or erroneous rewards. Yet, the interaction between such noise and widely used group-based policy optimization methods remains underexplored. We