Cross-Modal Taxonomic Generalization in (Vision-) Language Models

ArXi:2603.07474v1 Announce Type: new What is the interplay between semantic representations learned by language models (LM) from surface form alone to those learned from grounded evidence? We study this question for a scenario where part of the input comes from a different modality -- in our case, in a vision-language model (VLM), where a pretrained LM is aligned with a pretrained image encoder. As a, we focus on the task of predicting hypernyms of objects represented in images.