Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination

ArXi:2605.15864v1 Announce Type: cross Vision-Language Models (VLMs) often produce self-reflective statements like "let me check the figure again" during reasoning. Do such statements trigger genuine visual re-examination, or are they merely learned textual patterns? We investigate this via VisualSwap, an image-swap probing framework: after a model reasons over an image, we replace it with a visually similar but semantically different one and test whether the model notices. We