P2DNav: Panorama-to-Downview Reasoning for Zero-shot Vision-and-Language Navigation

ArXi:2605.19634v1 Announce Type: cross Vision-and-language navigation (VLN) requires an embodied agent to ground natural-language instructions into executable navigation actions in unseen environments. Existing zero-shot methods typically rely on additional waypoint prediction modules, which often entangle high-level directional reasoning with fine-grained local grounding, leading to error-prone and unstable decisions. In this paper, we propose P2DNa, a hierarchical framework for zero-shot vision-and-language navigation.