AI RESEARCH
Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration
arXiv CS.LG
•
ArXi:2605.05566v1 Announce Type: cross Reinforcement learning with verifiable rewards, particularly Group Relative Policy Optimization (GRPO), has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, in complex tasks, GRPO frequently suffers from the ``zero-advantage problem'': when all sampled rollouts for a query fail, the relative advantage collapses to zero. Consequently, the model loses effective