Revisit What You See: Revealing Visual Semantics in Vision Tokens to Guide LVLM Decoding

ArXi:2506.09522v3 Announce Type: replace-cross Large Vision Language Models (LVLMs) achieve strong performance across multimodal tasks by integrating visual perception with language understanding. However, how vision information contributes to the model's decoding process remains under-explored, as reflected in frequent hallucinations. Through a series of analyses, we found that (i) vision tokens provide meaningful visual information even when hallucinations occur, and (ii) their semantics are encoded in the textual space and become explicit under appropriate vocabulary constraints.