Tell Model Where to Look: Mitigating Hallucinations in MLLMs by Vision-Guided Attention

ArXi:2511.20032v2 Announce Type: replace Visual attention serves as the primary mechanism through which MLLMs interpret visual information; however, its limited localization capability often leads to hallucinations. We observe that although MLLMs can accurately extract visual semantics from visual tokens, they fail to fully leverage this advantage during subsequent inference. To address this limitation, we propose Vision-Guided Attention (VGA), a