Off-Policy Safe Reinforcement Learning with Constrained Optimistic Exploration

ArXi:2603.23889v1 Announce Type: new When safety is formulated as a limit of cumulative cost, safe reinforcement learning (RL) aims to learn policies that maximize return subject to the cost constraint in data collection and deployment. Off-policy safe RL methods, although offering high sample efficiency, suffer from constraint violations due to cost-agnostic exploration and estimation bias in cumulative cost.