Where Vision Becomes Text: Locating the OCR Routing Bottleneck in Vision-Language Models

ArXi:2602.22918v2 Announce Type: replace Vision-language models (VLMs) can read text from images, but where does this optical character recognition (OCR) information enter the language processing stream? We investigate the OCR routing mechanism across three architecture families (Qwen3-VL, Phi-4, InternVL3.5) using causal interventions.