A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

ArXi:2605.06200v1 Announce Type: new Reinforcement learning for agentic large language models (LLMs) typically relies on a sparse, trajectory-level outcome reward, making it difficult to evaluate the contribution of individual tool-calls within multi-turn interactions. Existing approaches to such process credit assignment either depend on separate external process reward models that