Cross-Modal Attention Analysis and Optimization in Vision-Language Models: A Study on Visual Reliability

ArXi:2604.17217v1 Announce Type: new Vision-Language Models (VLMs) achieve strong cross-modal performance, yet recent evidence suggests they over-rely on textual descriptions while under-utilizing visual evidence -- a phenomenon termed ``text shortcut learning.'' We propose an adversarial evaluation framework that quantifies this cross-modal dependency by measuring accuracy degradation (Drop) when semantically conflicting text is paired with unchanged images.