Interpreting Reinforcement Learning Agents with Susceptibilities

ArXi:2605.08007v1 Announce Type: new Susceptibilities are a technique for neural network interpretability that studies the response of posterior expectation values of observables to perturbations of the loss. We generalize this construction to the setting of the regret in deep reinforcement learning and investigate the utility of susceptibilities in a simple gridworld model that. nevertheless. exhibits non-trivial stagewise development.