Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation

ArXi:2605.02757v1 Announce Type: new Vision-language-action (VLA) models typically rely on large-scale real-world videos, whereas simulated data, despite being inexpensive and highly parallelizable to collect, often suffers from a substantial visual domain gap and limited environmental diversity, resulting in weak real-world generalization. We present an efficient video augmentation framework that converts simulated VLA videos into realistic