CARES: Context-Aware Resolution Selector for VLMs

ArXi:2510.19496v2 Announce Type: replace-cross Large vision-language models (VLMs) commonly process images at native or high resolution to remain effective across tasks. This inflates visual tokens ofter to 97-99% of total tokens, resulting in high compute and latency, even when low-resolution images would suffice. We