EVA: Aligning Video World Models with Executable Robot Actions via Inverse Dynamics Rewards

ArXi:2603.17808v2 Announce Type: replace-cross Video generative models are increasingly used as world models for robotics, where a model generates a future visual rollout conditioned on the current observation and task instruction, and an inverse dynamics model (IDM) converts the generated frames into executable robot actions. However, current video world models lack explicit executability constraints. As a result, visually coherent rollouts may still violate rigid-body and kinematic consistency, producing unstable or infeasible control commands when decoded by an