SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning

ArXi:2603.27437v1 Announce Type: new Large vision-language models (VLMs) still struggle with reliable 3D spatial reasoning, a core capability for embodied and physical AI systems. This limitation arises from their inability to capture fine-grained 3D geometry and spatial relationships. While recent efforts have