Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning

ArXi:2603.23404v1 Announce Type: cross Existing Multimodal Large Language Models (MLLMs) struggle with 3D spatial reasoning, as they fail to construct structured abstractions of the 3D environment depicted in video inputs. To bridge this gap, drawing inspiration from cognitive theories of allocentric spatial reasoning, we investigate how to enable MLLMs to model and reason over text-based spatial representations of video. Specifically, we