Medical Image Spatial Grounding with Semantic Sampling

ArXi:2603.14579v1 Announce Type: cross Vision language models (VLMs) have shown significant promise in visual grounding for images as well as videos. In medical imaging research, VLMs represent a bridge between object detection and segmentation, and report understanding and generation. However, spatial grounding of anatomical structures in the three-dimensional space of medical images poses many unique challenges.