SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning

ArXi:2510.16416v4 Announce Type: replace Vision-language models (VLMs) have shown remarkable abilities by integrating large language models with visual inputs. However, they often fail to utilize visual evidence adequately, either depending on linguistic priors in vision-centric tasks or resorting to textual shortcuts during reasoning. Although reinforcement learning (RL) can align models with desired behaviors, its application to VLMs has been hindered by the lack of scalable and reliable reward mechanisms.