Bridging the Missing-Modality Gap: Improving Text-Only Calibration of Vision Language Models

ArXi:2605.12517v1 Announce Type: cross Vision-language models (VLMs) are often deployed on text-only inputs, although they are trained with images. We find that removing the vision modality causes large drops in accuracy and severe miscalibration, and the model does not behave like its original language backbone under text-only prompting. This failure is not explained only by missing semantic information. Even when text descriptions preserve key content, confidence becomes unreliable, while adding a visual signal through generated images partially res accuracy and calibration.