DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning

ArXi:2509.25866v2 Announce Type: replace The "thinking with images" paradigm represents a pivotal shift in the reasoning of Vision Language Models (VLMs), moving from text-dominant chain-of-thought to image-interactive reasoning. By invoking visual tools or generating intermediate visual representations, VLMs can iteratively attend to fine-grained regions, enabling deeper image understanding and faithful multimodal reasoning.