Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Language Models

ArXi:2605.04874v1 Announce Type: new Direct Preference Optimization (DPO) has proven to be an effective solution for mitigating hallucination in Multimodal Large Language Models (MLLMs) by learning from preference pairs. One of its key challenges lies in how to transfer the sequence-level preference into fine-grained supervision on visual fidelity. To safeguard vision-related tokens that are prone to hallucination, existing methods typically allocate