From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

ArXi:2605.12167v1 Announce Type: cross Video generation models offer a promising imagination mechanism for robot manipulation by predicting long-horizon future observations, but effectively exploiting these imagined futures for action execution remains challenging. Existing approaches either condition policies on predicted frames or directly decode generated videos into actions, both suffering from a mismatch between visual realism and control relevance.