From Restless to Contextual: A Thresholding Bandit Reformulation For Finite-horizon Improvement

ArXi:2502.05145v5 Announce Type: replace This paper addresses the poor finite-horizon performance of existing online \emph{restless bandit} (RB) algorithms, which stems from the prohibitive sample complexity of learning a full \emph{Marko decision process} (MDP) for each agent. We argue that superior finite-horizon performance requires \emph{rapid convergence} to a \emph{high-quality} policy. Thus motivated, we