Overcoming Valid Action Suppression in Unmasked Policy Gradient Algorithms

ArXi:2603.09090v1 Announce Type: new In reinforcement learning environments with state-dependent action validity, action masking consistently outperforms penalty-based handling of invalid actions, yet existing theory only shows that masking preserves the policy gradient theorem. We identify a distinct failure mode of unmasked