VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs

ArXi:2603.23481v1 Announce Type: cross Video-Action Models (VAMs) have emerged as a promising framework for embodied intelligence, learning implicit world dynamics from raw video streams to produce temporally consistent action predictions. Although such models nstrate strong performance on long-horizon tasks through visual reasoning, they remain limited in contact-rich scenarios where critical interaction states are only partially observable from vision alone.