A-3PO: Accelerating Asynchronous LLM Training with Staleness-aware Proximal Policy Approximation

ArXi:2512.06547v3 Announce Type: replace-cross Decoupled PPO has been a successful reinforcement learning (RL) algorithm to deal with the high data staleness under the asynchronous RL setting. Decoupled loss used in decoupled PPO improves coupled-loss style of algorithms' (e.g., standard PPO, GRPO) learning stability by