Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning

ArXi:2505.07527v5 Announce Type: replace The advantage function is a central concept in RL that helps reduce variance in policy gradient estimates. For language modeling, Group Relative Policy Optimization (GRPO) was proposed to use the within-group sample mean as a baseline for advantage normalization. This estimator can be sensitive to small group size and rollout-level stochasticity, which may lead to suboptimal advantage estimates in some settings.