Preventing Learning Stagnation in PPO by Scaling to 1 Million Parallel Environments

ArXi:2603.06009v1 Announce Type: new Plateaus, where an agent's performance stagnates at a suboptimal level, are a common problem in deep on-policy RL. Focusing on PPO due to its widespread adoption, we show that plateaus in certain regimes arise not because of known exploration, capacity, or optimization challenges, but because sample-based estimates of the loss eventually become poor proxies for the true objective over the