Training-free Uncertainty Guidance for Complex Visual Tasks with MLLMs

ArXi:2510.00705v2 Announce Type: replace Multimodal Large Language Models (MLLMs) often struggle with fine-grained perception, such as identifying small objects in high-resolution images or detecting key moments in long videos. Existing methods typically rely on complex, task-specific fine-tuning, which reduces generalizability and increases system complexity. In this work, we propose an effective