Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

ArXi:2604.16256v1 Announce Type: cross Reasoning in vision-language models (VLMs) has recently attracted significant attention due to its broad applicability across diverse downstream tasks. However, it remains unclear whether the superior performance of VLMs stems from genuine vision-grounded reasoning or relies predominantly on the reasoning capabilities of their textual backbones. To systematically measure this, we