Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking

ArXi:2605.10893v2 Announce Type: replace Large vision-language models suffer from visual ungroundedness: they can produce a fluent, confident, and even correct response driven entirely by language priors, with the image contributing nothing to the prediction. Existing confidence estimation methods cannot detect this, as they observe model behavior under normal inference with no mechanism to determine whether a prediction was shaped by the image or by text alone. We