TIC-GRPO: Provable and Efficient Optimization for Reinforcement Learning from Human Feedback

ArXi:2508.02833v4 Announce Type: replace Group Relative Policy Optimization (GRPO), recently Motivated by this finding, we propose Trajectory-level Importance-Corrected GRPO (TIC-GRPO), a new algorithm that replaces token-level importance ratios with a single trajectory-level probability ratio, thereby yielding an estimate of the current policy gradient while preserving the critic-free structure. Furthermore, we present the first convergence analysis for GRPO-style methods and show that TIC-GRPO converges faster than