"It's trained by non-disabled people": Evaluating How Image Quality Affects Product Captioning with Vision-Language Models

ArXi:2511.08917v3 Announce Type: replace-cross Vision-Language Models (VLMs) are increasingly used by blind and low-vision (BLV) people to identify and understand products in their everyday lives, such as food, personal care items, and household goods. Despite their prevalence, we lack an empirical understanding of how common image quality issues--such as blur, misframing, and rotation--affect the accuracy of VLM-generated captions and whether the resulting captions meet BLV people's information needs.