RoboAlign: Learning Test-Time Reasoning for Language-Action Alignment in Vision-Language-Action Models

ArXi:2603.21341v1 Announce Type: new Improving embodied reasoning in multimodal-large-language models (MLLMs) is essential for building vision-language-action models (VLAs) on top of them to readily translate multimodal understanding into low-level actions. Accordingly, recent work has explored enhancing embodied reasoning in MLLMs through supervision of vision-question-answering type. However, these approaches have been reported to result in unstable VLA performance, often yielding only marginal or even negative gains. In this paper, we propose a systematic MLLM