Disentangled Robot Learning via Separate Forward and Inverse Dynamics Pretraining

ArXi:2604.16391v1 Announce Type: cross Vision-language-action (VLA) models have shown great potential in building generalist robots, but still face a dilemma-misalignment of 2D image forecasting and 3D action prediction. Besides, such a vision-action entangled