Model-Based Proactive Cost Generation for Learning Safe Policies Offline with Limited Violation Data

ArXi:2605.01356v1 Announce Type: new Learning constraint-satisfying policies from offline data without risky online interaction is crucial for safety-critical decision making. Conventional methods typically learn cost value functions from abundant unsafe samples to define safety boundaries and penalize violations. However, in high-stakes scenarios, risky trial-and-error is infeasible, yielding datasets with few or no unsafe samples.