Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

ArXi:2605.00347v1 Announce Type: cross Given the rapidly growing capabilities of vision-language models (VLMs), extending them to interactive decision-making tasks such as video games has emerged as a promising frontier. However, existing approaches either rely on large-scale supervised fine-tuning (SFT) on human trajectories or apply reinforcement learning (RL) only in relatively short-horizon settings (typically around 20--30 turns). In this work, we study RL-based