Instinct vs. Reflection: Unifying Token and Verbalized Confidence in Multimodal Large Models

ArXi:2604.17274v1 Announce Type: new Multimodal Large Language Models (MLLMs) have nstrated exceptional capabilities in various perception and reasoning tasks. Despite this success, ensuring their reliability in practical deployment necessitates robust confidence estimation. Prior works have predominantly focused on text-only LLMs, often relying on computationally expensive self-consistency sampling. In this paper, we extend this to multimodal settings and conduct a comprehensive evaluation of MLLMs' response confidence estimation.