Growing a Multi-head Twig via Distillation and Reinforcement Learning to Accelerate Large Vision-Language Models

ArXi:2503.14075v3 Announce Type: replace-cross Large vision-language models (VLMs) have nstrated remarkable capabilities in open-world multimodal understanding, yet their high computational overheads pose great challenges for practical deployment. Some recent works have proposed methods to accelerate VLMs by pruning redundant visual tokens guided by the attention maps of VLM's early layers.