VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

ArXi:2505.20279v3 Announce Type: replace-cross The rapid advancement of Large Multimodal Models (LMMs) for 2D images and videos has motivated extending these models to understand 3D scenes, aiming for human-like visual-spatial intelligence. Nevertheless, achieving deep spatial understanding comparable to human capabilities poses significant challenges in model encoding and data acquisition.