The Limits of Learning from Pictures and Text: Vision-Language Models and Embodied Scene Understanding

ArXi:2603.26589v1 Announce Type: new What information is sufficient to learn the full richness of human scene understanding? The distributional hypothesis holds that the statistical co-occurrence of language and images captures the conceptual knowledge underlying visual cognition. Vision-language models (VLMs) are trained on massive paired text-image corpora but lack embodied experience, making them an ideal test of the distributional hypothesis.