JAMMEval: A Refined Collection of Japanese Benchmarks for Reliable VLM Evaluation

ArXi:2604.00909v1 Announce Type: new Reliable evaluation is essential for the development of vision-language models (VLMs). However, Japanese VQA benchmarks have undergone far less iterative refinement than their English counterparts. As a result, many existing benchmarks contain issues such as ambiguous questions, incorrect answers, and instances that can be solved without visual grounding, undermining evaluation reliability and leading to misleading