VisTIRA: Closing the Image-Text Modality Gap in Visual Math Reasoning via Structured Tool Integration

ArXi:2601.14440v2 Announce Type: replace Vision-language models (VLMs) lag behind text-only language models on mathematical reasoning when the same problems are presented as images rather than text. We empirically characterize this as a modality gap: the same question in text form yields markedly higher accuracy than its visually typeset counterpart, due to compounded failures in reading dense formulas, layout, and mixed symbolic-diagrammatic context. First, we