VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling

ArXi:2512.02902v2 Announce Type: replace-cross Vision-language-action (VLA) models achieve strong in-distribution performance but degrade sharply under novel camera viewpoints and visual perturbations. We show that this brittleness primarily arises from misalignment in Spatial Modeling, rather than Physical Modeling. To address this, we propose a one-shot adaptation framework that recalibrates visual representations through lightweight, learnable updates.