SketchVLM: Vision language models can annotate images to explain thoughts and guide users

ArXi:2604.22875v1 Announce Type: new When answering questions about images, humans naturally point, label, and draw to explain their reasoning. In contrast, modern vision-language models (VLMs) such as Gemini-3-Pro and GPT-5 only respond with text, which can be difficult for users to verify. We present SketchVLM, a