Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs

ArXi:2603.16932v1 Announce Type: new Vision-language models (VLMs) typically process images at a native high-resolution, forcing a trade-off between accuracy and computational efficiency: high-resolution inputs capture fine details but incur significant computational costs, while low-resolution inputs advocate for efficiency, they potentially miss critical visual information, like small text.