Text-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents

ArXi:2604.09781v1 Announce Type: new Vision-Language Models (VLMs) exhibit strong visual reasoning capabilities, yet they still struggle with 3D understanding. In particular, VLMs often fail to infer a text-consistent goal 6D pose of a target object in a 3D scene. However, we find that with some inference-time techniques and iterative reasoning, VLMs can achieve dramatic performance gains.