Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models

ArXi:2601.22060v3 Announce Type: replace-cross Multimodal large language models (MLLMs) have achieved remarkable success across a broad range of vision tasks. However, constrained by the capacity of their internal world knowledge, prior work has proposed augmenting MLLMs by ``reasoning-then-tool-call'' for visual and textual search engines to obtain substantial gains on tasks requiring extensive factual information.