AI RESEARCH
ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence
arXiv CS.CV
•
ArXi:2605.13034v1 Announce Type: new Recent deep research systems have improved the ability of large language models to produce long, grounded reports through iterative retrieval and reasoning. However, most text-centered systems rely mainly on textual evidence, while multimodal systems often retrieve images only weakly or generate charts themselves, leaving source figures underused as evidence. We present ViDR, a multimodal deep research framework that grounds long-form reports in source figures.