Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks

ArXi:2605.18194v1 Announce Type: cross While Multi-Modal Large Language Models (MLLMs) nstrate impressive capabilities in general reasoning, their embodied spatial intelligence remains hampered by a "Cartesian Illusion" - a reliance on text-based probability distributions that lack grounded, 3D topological understanding. This limitation is starkly exposed in multi-agent environments, which demand than just scene perception; they require second-order Theory of Mind (ToM