Efficient Inference of Large Vision Language Models

ArXi:2603.27960v1 Announce Type: new Although Large Vision Language Models (LVLMs) have nstrated impressive multimodal reasoning capabilities, their scalability and deployment are constrained by massive computational requirements. In particular, the massive amount of visual tokens from high-resolution input data aggravates the situation due to the quadratic complexity of attention mechanisms. To address these issues, the research community has developed several optimization frameworks.