Structured Role-Aware Policy Optimization for Multimodal Reasoning

ArXi:2605.07274v1 Announce Type: new Reinforcement learning from verifiable rewards (RLVR), especially with Group Relative Policy Optimization (GRPO), has shown strong potential for improving the reasoning capabilities of large vision-language models (LVLMs). However, in multimodal reasoning, final-answer rewards are typically assigned at the sequence level and do not distinguish the functional roles of different tokens, making it difficult to determine whether a correct answer is ed by task-relevant visual evidence.