Pri4R: Learning World Dynamics for Vision-Language-Action Models with Privileged 4D Representation

ArXi:2603.01549v2 Announce Type: replace-cross Humans learn not only how their bodies move, but also how the surrounding world responds to their actions. In contrast, while recent Vision-Language-Action (VLA) models exhibit impressive semantic understanding, they often fail to capture the spatiotemporal dynamics governing physical interaction. In this paper, we