The LLM Bottleneck: Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition

ArXi:2505.24840v2 Announce Type: replace-cross This paper reveals that many open-source large language models (LLMs) lack hierarchical knowledge about our visual world, unaware of even well-established biology taxonomies. This shortcoming makes LLMs a bottleneck for vision LLMs' hierarchical visual recognition (e.g., recognizing Anemone Fish but not Vertebrate). We arrive at these findings using about one million four-choice visual question answering (VQA) tasks constructed from six taxonomies and four image datasets.