Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning

ArXi:2601.11109v3 Announce Type: replace-cross Vision-as-inverse-graphics, the concept of reconstructing images into editable programs, remains challenging for Vision-Language Models (VLMs), which inherently lack fine-grained spatial grounding in one-shot settings. To address this, we