Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning

ArXi:2601.06943v2 Announce Type: replace In real-world video question answering scenarios, videos often provide only localized visual cues, while verifiable answers are distributed across the open web; models. therefore. need to jointly perform cross-frame clue extraction, iterative retrieval, and multi-hop reasoning-based verification. To bridge this gap, we construct the first video deep research benchmark, VideoDR.