LARY: A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment

ArXi:2604.11689v1 Announce Type: new While the shortage of explicit action data limits Vision-Language-Action (VLA) models, human action videos offer a scalable yet unlabeled data source. A critical challenge in utilizing large-scale human video datasets lies in transforming visual signals into ontology-independent representations, known as latent actions. However, the capacity of latent action representation to derive robust control from visual observations has yet to be rigorously evaluated. We