WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering

ArXi:2604.05818v1 Announce Type: cross Multi-modal Retrieval-Augmented Generation (RAG) has emerged as a highly effective paradigm for Knowledge-Based Visual Question Answering (KB-VQA). Despite recent advancements, prevailing methods still primarily depend on images as the retrieval key, and often overlook or misplace the role of Vision-Language Models (VLMs), thereby failing to leverage their potential fully. In this paper, we