MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models

ArXi:2511.18373v2 Announce Type: replace Vision Language Models (VLMs) perform well on standard video tasks but struggle with physics-related reasoning involving motion dynamics and spatial interactions. We present a novel approach to address this gap by translating physical-world context cues into interpretable representations aligned with VLM perception, comprehension, and reasoning. We