StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

ArXi:2604.18401v1 Announce Type: new General agents have given rise to phenomenal applications such as OpenClaw and Claude Code. As these agent systems (a.k.a. Harnesses) strive for bolder goals, they demand increasingly stronger agentic capabilities from foundation Large Language Models (LLMs). Agentic Reinforcement Learning (RL) is emerging as a central post-