LEO-VL: Efficient Scene Representation for Scalable 3D Vision-Language Learning

ArXi:2506.09935v3 Announce Type: replace Developing vision-language models (VLMs) capable of understanding 3D scenes has been a longstanding research goal. Despite recent progress, 3D VLMs still struggle with spatial reasoning and robustness. We identify three key obstacles hindering their progress: (1) scene representation is constrained by a capacity-efficiency trade-off, which impedes scalable learning; (2)