GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training

ArXi:2512.13043v2 Announce Type: replace-cross Multi-turn reinforcement learning (RL) for multi-modal agents built upon vision-language models (VLMs) is hampered by sparse rewards and long-horizon credit assignment. Recent methods densify the reward by querying a teacher that provides step-level feedback, e.g., Guided Thought Reinforcement (GTR) and On-Policy Distillation, but rely on costly, often privileged models as the teacher, limiting practicality and reproducibility. We