UAM: A Dual-Stream Perspective on Forgetting in VLA Training

ArXi:2605.15735v1 Announce Type: cross Vision--language--action (VLA) models are typically built by fine-tuning a pretrained vision--language model (VLM) on action data. However, we show that this standard recipe systematically erodes the VLM's multimodal competence, a side effect we call the embodiment tax.