Referring Multiple Regions with Large Multimodal Models via Contextual Latent Steering

ArXi:2605.01827v1 Announce Type: new Large Multimodal Models (LMMs) have recently nstrated their proficiency in holistic visual comprehension. However, most of them struggle to tackle region-level perception guided by visual prompts, especially for cases where multiple regions are referred simultaneously, or scenarios where global contexts are necessary for precise visual referring. We