AI RESEARCH

How Off-Policy Can GRPO Be? Mu-GRPO for Efficient LLM Reinforcement Learning

arXiv CS.LG

ArXi:2605.17570v1 Announce Type: new Group Relative Policy Optimization (GRPO) has been a key driver of recent progress in reinforcement learning with verifiable rewards (RLVR) for large language models, but it is typically trained in a low-staleness, near-on-policy regime that incurs substantial system overhead. We ask a simple question: How off-policy can GRPO be? We show that GRPO-style algorithms can tolerate substantially larger rollout staleness than previously assumed, and propose Mu-GRPO, an RL.