Long-Horizon Model-Based Offline Reinforcement Learning Without Explicit Conservatism

ArXi:2512.04341v3 Announce Type: replace Popular offline reinforcement learning (RL) methods rely on explicit conservatism, penalizing out-of-dataset actions or restricting rollout horizons. We question the universality of this principle and revisit a complementary Bayesian perspective for test-time adaptation. By modeling a posterior over world models and