Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation

ArXi:2504.14988v3 Announce Type: replace Recent advancements in Large Vision-Language Models (LVLMs) have nstrated remarkable multimodal perception capabilities, garnering significant attention. While numerous evaluation studies have emerged, assessing LVLMs both holistically and on specialized tasks, fine-grained image tasks-fundamental to computer vision-remain largely unexplored. To fill this gap, we