HiPrune: Hierarchical Attention for Efficient Token Pruning in Vision-Language Models

ArXi:2508.00553v3 Announce Type: replace Vision-Language Models (VLMs) encode images and videos into abundant tokens, which contain substantial redundancy and computation cost. While visual token pruning mitigates the issue, most existing methods lack insight into the intrinsic property of the vision encoder itself. In this work, we dive into the vision encoder and prove that the middle layers pay attention to the main objects of the image qualitatively and quantitatively, while the deep layers to tokens with rich global information.