Abstract 3D Perception for Spatial Intelligence in Vision-Language Models

ArXi:2511.10946v3 Announce Type: replace Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding, which are crucial for real-world applications like robotics and embodied agents. We attribute this to a modality gap between the 3D tasks and the 2D