Benchmarking Deflection and Hallucination in Large Vision-Language Models

ArXi:2604.12033v1 Announce Type: cross Large Vision-Language Models (LVLMs) increasingly rely on retrieval to answer knowledge-intensive multimodal questions. Existing benchmarks overlook conflicts between visual and textual evidence and the importance of generating deflections (e.g., Sorry, I cannot answer. ) when retrieved knowledge is incomplete. These benchmarks also suffer from rapid obsolescence, as growing LVLM