Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models

ArXi:2604.01840v2 Announce Type: replace While Reinforcement Learning from Verifiable Rewards (RLVR) has advanced reasoning in Large Vision-Language Models (LVLMs), prevailing frameworks suffer from a foundational methodological flaw: by distributing identical advantages across all generated tokens, these methods inherently dilute the learning signals essential for optimizing the critical, visually-grounded steps of multimodal reasoning.