AI RESEARCH

Do Vision Language Models Need to Process Image Tokens?

arXiv CS.CV

ArXi:2604.09425v1 Announce Type: new Vision Language Models (VLMs) have achieved remarkable success by integrating visual encoders with large language models (LLMs). While VLMs process dense image tokens across deep transformer stacks (incurring substantial computational overhead), it remains fundamentally unclear whether sustained image-token processing is necessary for their performance or visual representations meaningfully evolve from early to later layers.