Rethinking Ground Truth: A Case Study on Human Label Variation in MLLM Benchmarking

ArXi:2603.19744v1 Announce Type: new Human Label Variation (HLV), i.e. systematic differences among annotators' judgments, remains underexplored in benchmarks despite rapid progress in large language model (LLM) development. We address this gap by