GIST: Multimodal Knowledge Extraction and Spatial Grounding via Intelligent Semantic Topology

ArXi:2604.15495v1 Announce Type: new Navigating complex, densely packed environments like retail s, warehouses, and hospitals poses a significant spatial grounding challenge for humans and embodied AI. In these spaces, dense visual features quickly become stale given the quasi-static nature of items, and long-tail semantic distributions challenge traditional computer vision. While Vision-Language Models (VLMs) help assistive systems navigate semantically-rich spaces, they still struggle with spatial grounding in cluttered environments.