Think, Act, Build: An Agentic Framework with Vision Language Models for Zero-Shot 3D Visual Grounding

ArXi:2604.00528v1 Announce Type: cross 3D Visual Grounding (3D-VG) aims to localize objects in 3D scenes via natural language descriptions. While recent advancements leveraging Vision-Language Models (VLMs) have explored zero-shot possibilities, they typically suffer from a static workflow relying on preprocessed 3D point clouds, essentially degrading grounding into proposal matching.