Point What You Mean: Visually Grounded Instruction Policy

ArXi:2512.18933v2 Announce Type: replace Vision-Language-Action (VLA) models align vision and language with embodied control, but their object referring ability remains limited when relying solely on text prompt, especially in cluttered or out-of-distribution (OOD) scenes. In this study, we