Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation

ArXi:2603.19166v1 Announce Type: cross Robots collaborating with humans must convert natural language goals into actionable, physically grounded decisions. For example, executing a command such as "go two meters to the right of the fridge" requires grounding semantic references, spatial relations, and metric constraints within a 3D scene. While recent vision language models (VLMs) nstrate strong semantic grounding capabilities, they are not explicitly designed to reason about metric constraints in physically defined spaces.