DiLA: Disentangled Latent Action World Models

ArXi:2605.15725v1 Announce Type: cross Latent Action Models (LAMs) enable the learning of world models from unlabeled video by inferring abstract actions between consecutive frames. However, LAMs face a fundamental trade-off between action abstraction and generation fidelity. Existing methods typically circumvent this issue by using two-stage