PatchCue: Enhancing Vision-Language Model Reasoning with Patch-Based Visual Cues

ArXi:2603.05869v1 Announce Type: new Vision-Language Models (VLMs) have achieved remarkable progress on a wide range of challenging multimodal understanding and reasoning tasks. However, existing reasoning paradigms, such as the classical Chain-of-Thought (CoT), rely solely on textual information and often underutilize important visual cues. While prior work has incorporated pixel-level visual cues, these representations require precise spatial localization,