Structured Observation Language for Efficient and Generalizable Vision-Language Navigation

ArXi:2603.27577v1 Announce Type: new Vision-Language Navigation (VLN) requires an embodied agent to navigate complex environments by following natural language instructions, which typically demands tight fusion of visual and language modalities. Existing VLN methods often convert raw images into visual tokens or implicit features, requiring large-scale visual pre-