Allegory of the Cave: Measurement-Grounded Vision-Language Learning

ArXi:2605.11727v1 Announce Type: new Vision-language models typically reason over post-ISP RGB images, although RGB rendering can clip, suppress, or quantize sensor evidence before inference. We study whether grounding improves when the visual interface is moved closer to the underlying camera measurement.