DiG: Differential Grounding for Enhancing Fine-Grained Perception in Multimodal Large Language Model

ArXi:2512.12633v2 Announce Type: replace-cross Multimodal Large Language Models have achieved impressive performance on a variety of vision-language tasks, yet their fine-grained visual perception and precise spatial reasoning remain limited. In this work, we