CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR

ArXi:2603.10101v1 Announce Type: cross Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capacity of Large Language Models (LLMs). However, RLVR solely relies on final answers as outcome rewards, neglecting the correctness of intermediate reasoning steps