Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment

ArXi:2605.08064v1 Announce Type: new Spatial intelligence in vision-language models (VLMs) attracts research interest with the practical demand to reason in the 3D world. Despite promising results, most existing methods follow the conventional 2D pipeline in VLMs and use pixel-aligned representations for the vision modality. However, correspondence-based models with implicit 3D scene understanding often fail to achieve spatial consistency, and representation-based models with 3D geometric priors lack efficiency in vision sequence serialization.