VIPO: Value Function Inconsistency Penalized Offline Reinforcement Learning

ArXi:2504.11944v3 Announce Type: replace-cross Offline reinforcement learning (RL) learns effective policies from pre-collected datasets, offering a practical solution for applications where online interactions are risky or costly. Model-based approaches are particularly advantageous for offline RL, owing to their data efficiency and generalizability. However, due to inherent model errors, model-based methods often artificially