VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors

ArXi:2604.02486v1 Announce Type: cross Vision Language Models (VLMs) achieve impressive performance across a wide range of multimodal tasks. However, on some tasks that demand fine-grained visual perception, they often fail even when the required information is present in their internal representations. In this work, we nstrate that this gap arises from their narrow