AI RESEARCH
Transformation-Augmented GRPO for Enhancing Exploration in Reasoning of Large Language Models
arXiv CS.LG
•
ArXi:2601.22478v3 Announce Type: replace Group Relative Policy Optimization (GRPO) has become the dominant method for reinforcement learning with verifiable rewards in large language models, but it suffers from two critical limitations: gradient vanishing and diversity collapse. When