Mitigating Value Hallucination in Dyna Planning via Multistep Predecessor Models

ArXi:2006.04363v2 Announce Type: replace-cross Dyna-style reinforcement learning (RL) agents improve sample efficiency over model-free RL agents by updating the value function with simulated experience generated by an environment model. However, it is often difficult to learn accurate models of environment dynamics, and even small errors may result in failure of Dyna agents. In this paper, we highlight that one potential cause of that failure is bootstrapping off of the values of simulated states, and