$V_{0.5}$: Generalist Value Model as a Prior for Sparse RL Rollouts

ArXi:2603.10848v1 Announce Type: cross In Reinforcement Learning with Verifiable Rewards (RLVR), constructing a robust advantage baseline is critical for policy gradients, effectively guiding the policy model to reinforce desired behaviors. Recent research has