Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow

ArXi:2604.15809v1 Announce Type: new Vision-Language Models (VLMs) have nstrated strong capability in a wide range of tasks such as visual recognition, document parsing, and visual grounding. Nevertheless, recent work shows that while VLMs often manage to capture the correct image region corresponding to the question, they do not necessarily produce the correct answers. In this work, we nstrate that this misalignment could be attributed to suboptimal information flow within VLMs, where text tokens distribute too much attention to irrelevant visual tokens, leading to incorrect answers.