Segment-Aligned Policy Optimization for Multi-Modal Reasoning

ArXi:2605.01327v1 Announce Type: cross Existing reinforcement learning approaches for Large Language Models typically perform policy optimization at the granularity of individual tokens or entire response sequences. However, such formulations often misalign with the natural step-wise structure of reasoning processes, leading to suboptimal credit assignment and unstable