Being-H0.7: A Latent World-Action Model from Egocentric Videos

ArXi:2605.00078v1 Announce Type: cross Visual-Language-Action models (VLAs) have advanced generalist robot control by mapping multimodal observations and language instructions directly to actions, but sparse action supervision often encourages shortcut mappings rather than representations of dynamics, contact, and task progress. Recent world-action models