Action Without Interaction: Probing the Physical Foundations of Video LMMs via Contact-Release Detection

ArXi:2511.20162v2 Announce Type: replace Large multi-modal models (LMMs) show increasing performance in realistic visual tasks for images and, recently, for videos. For example, given a video sequence, such models are able to describe in detail objects, the surroundings and dynamic actions. In this study, we explored the extent to which these models ground their semantic understanding in the actual visual input. Specifically, given sequences of hands interacting with objects, we asked models when and where the interaction begins or ends. For this purpose, we.