Focus, Don't Prune: Identifying Instruction-Relevant Regions for Information-Rich Image Understanding

ArXi:2603.22815v1 Announce Type: cross Large Vision-Language Models (LVLMs) have shown strong performance across various multimodal tasks by leveraging the reasoning capabilities of Large Language Models (LLMs). However, processing visually complex and information-rich images, such as infographics or document layouts, requires these models to generate a large number of visual tokens, leading to significant computational overhead.