MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding

ArXi:2512.02906v3 Announce Type: replace-cross Understanding high-resolution (HR) images remains a critical challenge for multimodal large language models (MLLMs). Recent approaches leverage vision-based retrieval-augmented generation (RAG) to retrieve query-relevant crops from HR images, improving understanding capacity of MLLMs. However, this paradigm often leads to object fragmentation, resulting in semantic bias and incomplete retrieval, while also