Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls

ArXi:2605.19728v1 Announce Type: new Foundation video models produce visually impressive results, but their use in embodied AI remains limited because they are primarily trained on natural language rather than low-level control signals. This limitation is especially pronounced for aerial flight, where motion occurs in unconstrained 6-DoF space and small errors in ego-motion can produce large trajectory drift. Generating aerial videos that follow fine-grained inertial actions can scalable