All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models

ArXi:2604.00479v1 Announce Type: new Recent studies have nstrated that Reinforcement Learning (RL), notably Group Relative Policy Optimization (GRPO), can intrinsically elicit and enhance the reasoning capabilities of Vision-Language Models (VLMs). However, despite the promise, the underlying mechanisms that drive the effectiveness of RL models as well as their limitations remain underexplored.