BareBones: Benchmarking Zero-Shot Geometric Comprehension in VLMs

ArXi:2604.10528v1 Announce Type: new While Vision-Language Models (VLMs) nstrate remarkable zero-shot recognition capabilities across a diverse spectrum of multimodal tasks, it yet remains an open question whether these architectures genuinely comprehend geometric structure or merely exploit RGB textures and contextual priors as statistical shortcuts. Existing evaluations fail to isolate this mechanism, conflating semantic reasoning with texture mapping and relying on imprecise annotations that inadvertently leak environmental cues. To address this gap, we.