Robust Probabilistic Shielding for Safe Offline Reinforcement Learning

ArXi:2605.10293v1 Announce Type: cross In offline reinforcement learning (RL), we learn policies from fixed datasets without environment interaction. The major challenges are to provide guarantees on the (1) performance and (2) safety of the resulting policy. A technique called safe policy improvement (SPI) provides a performance guarantee: with high probability, the new policy outperforms a given baseline policy, which is assumed to be safe.