AI RESEARCH
Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective
arXiv CS.AI
•
ArXi:2605.07331v1 Announce Type: cross Reinforcement learning, including reinforcement learning with verifiable rewards (RLVR), has emerged as a powerful approach for LLM post-