DianJin-OCR-R1: Enhancing OCR Capabilities via a Reasoning-and-Tool Interleaved Vision-Language Model

ArXi:2508.13238v3 Announce Type: replace Recent advances in vision-language models (VLMs) have enabled end-to-end document parsing and understanding, achieving strong performance on diverse optical character recognition (OCR) tasks. However, VLMs are prone to generate words that do not exist in the input image due to over-reliance on language priors.