AI RESEARCH

Post-Training with Policy Gradients: Optimality and the Base Model Barrier

arXiv CS.LG

ArXi:2603.06957v1 Announce Type: cross We study post-