Implicit Geometry Representations for Vision-and-Language Navigation from Web Videos

ArXi:2603.09259v1 Announce Type: new Vision-and-Language Navigation (VLN) has long been constrained by the limited diversity and scalability of simulator-curated datasets, which fail to capture the complexity of real-world environments. To overcome this limitation, we