VisionZip: Longer is Better but Not Necessary in Vision Language Models

ArXi:2412.04467v2 Announce Type: replace-cross Recent advancements in vision-language models have enhanced performance by increasing the length of visual tokens, making them much longer than text tokens and significantly raising computational costs. However, we observe that the visual tokens generated by popular vision encoders, such as CLIP and SigLIP, contain significant redundancy. To address this, we