VisBrowse-Bench: Benchmarking Visual-Native Search for Multimodal Browsing Agents

ArXi:2603.16289v1 Announce Type: cross The rapid advancement of Multimodal Large Language Models (MLLMs) has enabled browsing agents to acquire and reason over multimodal information in the real world. But existing benchmarks suffer from two limitations: insufficient evaluation of visual reasoning ability and the neglect of native visual information of web pages in the reasoning chains. To address these challenges, we