Model-Based Learning of Near-Optimal Finite-Window Policies in POMDPs

ArXi:2604.01024v1 Announce Type: new We study model-based learning of finite-window policies in tabular partially observable Marko decision processes (POMDPs). A common approach to learning under partial observability is to approximate unbounded history dependencies using finite action-observation windows. This induces a finite-state Marko decision process (MDP) over histories, referred to as the superstate MDP. Once a model of this superstate MDP is available, standard MDP algorithms can be used to compute optimal policies, motivating the need for sample-efficient model estimation.