ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models

ArXi:2509.21991v2 Announce Type: replace-cross Efficient processing of high-resolution images is crucial for real-world vision-language applications. However, existing Large Vision-Language Models (LVLMs) incur substantial computational overhead due to the large number of vision tokens. With the advent of "thinking with images" models, reasoning now extends beyond text to the visual domain.