AI RESEARCH
Imperfect World Models are Exploitable
arXiv CS.AI
•
ArXi:2605.15960v1 Announce Type: new We propose a novel definition of model exploitation in reinforcement learning. Informally, a world model is exploitable if it implies that one policy should be strictly preferred over another while the environment's true transition model implies the reverse. We analogize our definition with a prior characterization of reward hacking but show that the associated proof of inevitability does not transfer to exploitation.