AI RESEARCH
Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning
arXiv CS.AI
•
ArXi:2602.20197v2 Announce Type: replace-cross Reinforcement Learning with verifiable rewards (RLVR) has emerged as a primary learning paradigm for enhancing the reasoning capabilities of multi-modal large language models (MLLMs). However, during RL