Towards High-Consistency Embodied World Model with Multi-View Trajectory Videos

ArXi:2511.12882v3 Announce Type: replace-cross Embodied world models aim to predict and interact with the physical world through visual observations and actions. However, existing models struggle to accurately translate low-level actions (e.g., joint positions) into precise robotic movements in predicted frames, leading to inconsistencies with real-world physical interactions. To address these limitations, we propose MTV-World, an embodied world model that