System-Mediated Attention Imbalances Make Vision-Language Models Say Yes

ArXi:2601.12430v2 Announce Type: replace Vision-language model (VLM) hallucination is commonly linked to imbalanced allocation of attention across input modalities: system, image and text. However, existing mitigation strategies tend towards an image-centric interpretation of these imbalances, often prioritising increased image attention while giving less consideration to the roles of the other modalities.