On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs

ArXi:2602.12506v2 Announce Type: replace Reinforcement learning (RL) fine-tuning has become a key technique for enhancing large language models (LLMs) on reasoning-intensive tasks, motivating its extension to vision language models (VLMs). While RL-tuned VLMs improve on visual reasoning benchmarks, they remain vulnerable to weak visual grounding, hallucinations, and over-reliance on textual cues.