EP-GRPO: Entropy-Progress Aligned Group Relative Policy Optimization with Implicit Process Guidance

ArXi:2605.04960v1 Announce Type: cross Reinforcement learning with verifiable rewards (RLVR), particularly Group Relative Policy Optimization (GRPO), has advanced LLM reasoning. However, GRPO suffers from three credit assignment failures: uniform token-level granularity that ignores heterogeneous informational value, uniform polarity that penalizes correct steps and rewards incorrect ones, and zero-variance collapse that erases outcome-driven gradients.