AI RESEARCH
Post-Training with Policy Gradients: Optimality and the Base Model Barrier
arXiv CS.LG
•
ArXi:2603.06957v1 Announce Type: cross We study post-