Are there any benchmarks or leaderboards for image description with LLMs?
r/LocalLLaMA
•
Generative AI
Computer Vision
Hi everyone, I’m looking for benchmarks or leaderboards specifically focused on image description / image captioning quality with LLMs or VLMs. Most of the benchmarks I find are about general multimodal reasoning, VQA, OCR, or broad vision-language performance, but what I really want is something that evaluates how well models describe an image in natural language.