From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation

ArXi:2505.08548v3 Announce Type: replace-cross Achieving generalization in robotic manipulation remains a critical challenge, particularly for unseen scenarios and novel tasks. Current Vision-Language-Action (VLA) models, while building on top of general Vision-Language Models (VLMs), still fall short of achieving robust zero-shot performance due to the scarcity and heterogeneity prevalent in embodied datasets.