DocVAL: Validated Chain-of-Thought Distillation for Grounded Document VQA

ArXi:2511.22521v2 Announce Type: replace-cross Document visual question answering requires models not only to answer questions correctly, but also to precisely localize answers within complex document layouts. While large vision-language models (VLMs) achieve strong spatial grounding, their inference cost and latency limit real-world deployment. Compact VLMs are efficient, but they often suffer substantial localization degradation under standard fine-tuning or distillation.