When Looking Is Not Enough: Visual Attention Structure Reveals Hallucination in MLLMs

ArXi:2605.11559v1 Announce Type: cross Multimodal large language models (MLLMs) have become a key interface for visual reasoning and grounded question answering, yet they remain vulnerable to visual hallucinations, where generated responses contradict image content or mention nonexistent objects. A central challenge is that hallucination is not always caused by a simple lack of visual attention: the model may still assign substantial attention mass to image tokens while internally drifting toward an incorrect answer.