ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering

ArXi:2511.22715v2 Announce Type: replace-cross Multimodal Large Language Models (MLLMs) have shown impressive capabilities in jointly understanding text, images, and videos, often evaluated via Visual Question Answering (VQA). However, even state-of-the-art MLLMs struggle with domain-specific or knowledge-intensive queries, where relevant information is underrepresented in pre-