AI RESEARCH

Doc-V*:Coarse-to-Fine Interactive Visual Reasoning for Multi-Page Document VQA

arXiv CS.CL

ArXi:2604.13731v1 Announce Type: new Multi-page Document Visual Question Answering requires reasoning over semantics, layouts, and visual elements in long, visually dense documents. Existing OCR-free methods face a trade-off between capacity and precision: end-to-end models scale poorly with document length, while visual retrieval-based pipelines are brittle and passive. We propose Doc-$V^*$, an \textbf{OCR-free agentic} framework that casts multi-page DocVQA as sequential evidence aggregation.