\textsc{NaVIDA}: Vision-Language Navigation with Inverse Dynamics Augmentation

ArXi:2601.18188v2 Announce Type: replace-cross Vision-and-Language Navigation (VLN) requires agents to interpret natural language instructions and act coherently in visually rich environments. However, most existing methods rely on reactive state-action mappings without explicitly action-grounded visual dynamics modeling. Lacking awareness of how actions transform subsequent visual observations, agents cannot plan actions rationally, leading to unstable behaviors, weak generalization, and cumulative error along trajectory. To address these issues, we