Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning

ArXi:2511.16160v2 Announce Type: replace Spatial intelligence is a critical frontier for Multimodal Large Language Models (MLLMs), empowering them to comprehend the physical world. Drawing inspiration from human perception mechanisms, prior studies attempt to construct a spatial understanding via grid-based cognitive maps. However, current grid-based map methods rely on discretized representations, which limit the model's ability in fine-grained spatial reasoning. To overcome this limitation, we propose Video2Layout, a framework for reconstructing metric-grounded spatial layouts from video.