AI RESEARCH
LARY: A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment
arXiv CS.CV
•
ArXi:2604.11689v1 Announce Type: new While the shortage of explicit action data limits Vision-Language-Action (VLA) models, human action videos offer a scalable yet unlabeled data source. A critical challenge in utilizing large-scale human video datasets lies in transforming visual signals into ontology-independent representations, known as latent actions. However, the capacity of latent action representation to derive robust control from visual observations has yet to be rigorously evaluated. We