IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation

ArXi:2605.14712v1 Announce Type: cross Robot imitation data are often multimodal: similar visual-language observations may be followed by different action chunks because human nstrators act with different short-horizon intents, task phases, or recent context. Existing frame-conditioned VLA policies infer each chunk from the current observation and instruction alone, so under partial observability they may resample different intents across adjacent replanning steps, leading to inter-chunk conflict and unstable execution. We.