Targeted Exploration via Unified Entropy Control for Reinforcement Learning

ArXi:2604.14646v1 Announce Type: new Recent advances in reinforcement learning (RL) have improved the reasoning capabilities of large language models (LLMs) and vision-language models (VLMs). However, the widely used Group Relative Policy Optimization (GRPO) consistently suffers from entropy collapse, causing the policy to converge prematurely and lose diversity. Existing exploration methods