From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models

ArXi:2605.04678v1 Announce Type: cross Latent actions serve as an intermediate representation that enables consistent modeling of vision-language-action (VLA) models across heterogeneous datasets. However, approaches to supervising VLAs with latent actions are fragmented and lack a systematic comparison. This work structures the study of latent action supervision from two perspectives: (i) regularizing the trajectory via image-based latent actions, and (ii) unifying the target space with action-based latent actions.