Beyond a Single Frame: Multi-Frame Spatially Grounded Reasoning Across Volumetric MRI

ArXi:2604.15808v1 Announce Type: cross Spatial reasoning and visual grounding are core capabilities for vision-language models (VLMs), yet most medical VLMs produce predictions without transparent reasoning or spatial evidence. Existing benchmarks also evaluate VLMs on isolated 2D images, overlooking the volumetric nature of clinical imaging, where findings can span multiple frames or appear on only a few slices. We