Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning

ArXi:2507.00748v3 Announce Type: replace Multimodal Large Language Models (MLLMs) perform well in single-image visual grounding but struggle with real-world tasks that demand cross-image reasoning and multi-modal instructions. To address this, we adopt a reinforcement learning (RL) based post-