Optimistic Policy Regularization

ArXi:2603.06793v1 Announce Type: new Deep reinforcement learning agents frequently suffer from premature convergence, where early entropy collapse causes the policy to discard exploratory behaviors before discovering globally optimal strategies. We