LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models

ArXi:2604.23950v1 Announce Type: new Vision-Language Models (VLMs) have recently nstrated remarkable capabilities in visual understanding and reasoning, but they also impose significant computational burdens due to long visual sequence inputs. Recent works address this issue by pruning unimportant visual tokens, achieving substantial computational reduction while maintaining model performance. The core of token pruning lies in determining token importance, with current approaches primarily relying on attention scores from vision encoders or Large Language Models (LLMs.