Visually-Guided Policy Optimization for Multimodal Reasoning

ArXi:2604.09349v1 Announce Type: cross Reinforcement learning with verifiable rewards (RLVR) has significantly advanced the reasoning ability of vision-language models (VLMs). However, the inherent text-dominated nature of VLMs often leads to insufficient visual faithfulness, characterized by sparse attention activation to visual tokens. importantly, our empirical analysis reveals that temporal visual forgetting along reasoning steps exacerbates this deficiency.