Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models

ArXi:2605.13632v1 Announce Type: cross In this paper, we propose GTA-VLA(Guide, Think, Act), an interactive Vision-Language-Action (VLA) framework that enables spatially steerable embodied reasoning by allowing users to guide robot policies with explicit visual cues. Existing VLA models distribution, such tightly coupled policies are brittle under out-of-domain (OOD) shifts and difficult to correct when failures occur.