Offline Policy Optimization with Posterior Sampling

ArXi:2605.07393v1 Announce Type: new A fundamental challenge in model-based offline reinforcement learning (RL) lies in the trade-off between generalization and robustness against exploitation errors in out-of-distribution (OOD) regions. While OOD samples may capture valid underlying physical dynamics, they also