Permutation-Consensus Listwise Judging for Robust Factuality Evaluation

ArXi:2603.20562v1 Announce Type: cross Large language models (LLMs) are now widely used as judges, yet their decisions can change under presentation choices that should be irrelevant. We study one such source of instability: candidate-order sensitivity in listwise factuality evaluation, where several answers can look similarly polished while differing sharply in hallucination risk. We