Multi-modal, Multi-task, Multi-criteria Automatic Evaluation with Vision Language Models

ArXi:2412.14613v3 Announce Type: replace Vision-language models (VLMs) have shown impressive abilities across a range of multi-modal tasks. However, existing metrics for evaluating the quality of text generated by VLMs typically focus on an overall evaluation for a specific task, such as image captioning. While the overall evaluation is essential for any task, the criteria prioritized can differ depending on the task, making it challenging for current metrics to adapt to multi-task scenarios.