AI RESEARCH

Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy

arXiv CS.LG

ArXi:2605.14558v1 Announce Type: new Agentic reinforcement learning trains large language models using multi-turn trajectories that interleave long reasoning traces with short environment-facing actions. Common policy-gradient methods, such as PPO and GRPO, treat each token in a trajectory equally, leading to uniform credit assignment. In this paper, we critically nstrate that such uniform credit assignment largely misallocates token-level