AI RESEARCH
Point What You Mean: Visually Grounded Instruction Policy
arXiv CS.CV
•
ArXi:2512.18933v2 Announce Type: replace Vision-Language-Action (VLA) models align vision and language with embodied control, but their object referring ability remains limited when relying solely on text prompt, especially in cluttered or out-of-distribution (OOD) scenes. In this study, we