New LLM Position Bias Benchmark: does an LLM keep the same judgment when you swap the answer order? Judge models compare two lightly edited versions of the same story twice, with the order swapped. The median model flips in 45% of decisive case pairs. GPT-5.4 is worst at 66%.

Info, including charts, per-case metrics, raw judge outputs, and the parsed answer dump: This benchmark isolates one basic and frustrating failure mode. The model-average first-shown pick rate is 63%. GPT-5.4 (high) is the most position-sensitive model in the run. Many models don't just pick the first story often, they also rate it higher. Average first-position rating bonus is +0.26 on a 1-7 scale. Mistral Large 3 is the outlier in the opposite direction. Xiaomi MiMo V2 Pro has the lowest flip rate (20%) but only 55% coverage.