$V_0$: A Generalist Value Model for Any Policy at State Zero

ArXi:2602.03584v2 Announce Type: replace-cross Policy gradient methods rely on a baseline to measure the relative advantage of an action, ensuring the model reinforces behaviors that outperform its current average capability. In the