Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models

ArXi:2604.00161v1 Announce Type: new Optical Character Recognition (OCR) is increasingly regarded as a foundational capability for modern vision-language models (VLMs), enabling them not only to read text in images but also to downstream reasoning in real-world visual question answering (VQA). However, practical applications further require reliable text anchors, i.e., accurately grounding queried text to its corresponding spatial region. To systematically evaluate this capability, we