MI-Pruner: Crossmodal Mutual Information-guided Token Pruner for Efficient MLLMs

ArXi:2604.03072v1 Announce Type: new For multimodal large language models (MLLMs), visual information is relatively sparse compared with text. As a result, research on visual pruning emerges for efficient inference. Current approaches typically measure token importance based on the attention scores in the visual encoder or in the LLM decoder, then select visual tokens with high attention scores while pruning others. In this paper, we pursue a different and surgical approach.