When More Words Say Less: Decoupling Length and Specificity in Image Description Evaluation

ArXi:2601.04609v2 Announce Type: replace Vision-language models (VLMs) are increasingly used to make visual content accessible via text-based descriptions. In current systems, however, description specificity is often conflated with their length. We argue that these two concepts must be disentangled: descriptions can be concise yet dense with information, or lengthy yet vacuous. We define specificity relative to a contrast set, where a description is specific to the extent that it picks out the target image better than other possible images.