Advantage Shaping as Surrogate Reward Maximization: Unifying Pass@K Policy Gradients

ArXi:2510.23049v3 Announce Type: replace-cross This note reconciles two seemingly distinct approaches to policy gradient optimization for the Pass objective in reinforcement learning with verifiable rewards: (1) direct REINFORCE-style methods, and (2) advantage-shaping techniques that directly modify GRPO. We show that these are two sides of the same coin. By reverse-engineering existing advantage-shaping algorithms, we reveal that they implicitly optimize surrogate rewards. We specifically interpret practical "hard-example up-weighting" modifications to GRPO as reward-level regularization.