Learning Multi-View Spatial Reasoning from Cross-View Relations

ArXi:2603.27967v1 Announce Type: new Vision-language models (VLMs) have achieved impressive results on single-view vision tasks, but lack the multi-view spatial reasoning capabilities essential for embodied AI systems to understand 3D environments and manipulate objects across different viewpoints. In this work, we