Any3D-VLA: Enhancing VLA Robustness via Diverse Point Clouds

ArXi:2602.00807v2 Announce Type: replace Existing Vision-Language-Action (VLA) models typically take 2D images as visual input, which limits their spatial understanding in complex scenes. How can we incorporate 3D information to enhance VLA capabilities? We conduct a pilot study across different observation spaces and visual representations. The results show that explicitly lifting visual input into point clouds yields representations that better complement their corresponding 2D representations.