AI RESEARCH

MTA-Agent: An Open Recipe for Multimodal Deep Search Agents

arXiv CS.CV

ArXi:2604.06376v1 Announce Type: new Multimodal large language models (MLLMs) have nstrated strong capabilities in visual understanding, yet they remain limited in complex, multi-step reasoning that requires deep searching and integrating visual evidence with external knowledge. In this work, we address this challenge by constructing high-quality, verified multi-hop vision-language