ProFocus: Proactive Perception and Focused Reasoning in Vision-and-Language Navigation

ArXi:2603.05530v1 Announce Type: cross Vision-and-Language Navigation (VLN) requires agents to accurately perceive complex visual environments and reason over navigation instructions and histories. However, existing methods passively process redundant visual inputs and treat all historical contexts indiscriminately, resulting in inefficient perception and unfocused reasoning. To address these challenges, we propose \textbf{ProFocus}, a