When Seeing Overrides Knowing: Disentangling Knowledge Conflicts in Vision-Language Models

ArXi:2507.13868v2 Announce Type: replace Vision-language models (VLMs) increasingly combine visual and textual information to perform complex tasks. However, conflicts between their internal knowledge and external visual input can lead to hallucinations and unreliable predictions. In this work, we investigate the mechanisms that VLMs use to resolve cross-modal conflicts by