Are there any benchmarks or leaderboards for image description with LLMs?

r/LocalLLaMA
Generative AI Computer Vision

Hi everyone, I’m looking for benchmarks or leaderboards specifically focused on image description / image captioning quality with LLMs or VLMs. Most of the benchmarks I find are about general multimodal reasoning, VQA, OCR, or broad vision-language performance, but what I really want is something that evaluates how well models describe an image in natural language.