Vision-language models lag human performance on physical dynamics and intent reasoning

ArXi:2601.01547v2 Announce Type: replace-cross Spatial intelligence is central to embodied cognition, yet contemporary AI systems still struggle to reason about physical interactions in open-world human environments. Despite strong performance on controlled benchmarks, vision-language models often fail to jointly model physical dynamics, reference frames, and the latent human intentions that drive spatial change. We