RoboAlign-R1: Distilled Multimodal Reward Alignment for Robot Video World Models

ArXi:2605.03821v1 Announce Type: cross Existing robot video world models are typically trained with low-level objectives such as reconstruction and perceptual similarity, which are poorly aligned with the capabilities that matter most for robot decision making, including instruction following, manipulation success, and physical plausibility. They also suffer from error accumulation in long-horizon autoregressive prediction. We present RoboAlign-R1, a framework that combines reward-aligned post.