Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding

ArXi:2604.12358v1 Announce Type: new Recently, visual token pruning has been studied to handle the vast number of visual tokens in Multimodal Large Language Models. However, we observe that while existing pruning methods perform reliably on simple visual understanding, they struggle to effectively generalize to complex visual reasoning tasks, a critical gap underexplored in previous studies. Through a systematic analysis, we identify Relevant Visual Information Shift (RVIS) during decoding as the primary failure driver.