VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation

ArXi:2604.25235v1 Announce Type: new Vision-language models (VLMs) are increasingly used as automated judges for multimodal systems, yet their scores provide no indication of reliability. We study this problem through conformal prediction, a distribution-free framework that converts a judge's point score into a calibrated prediction interval using only score-token log-probabilities, with no re