Update to the LLM Debate Benchmark: GPT-5.5, Grok 4.3, DeepSeek V4 Pro, GLM-5.1, Kimi K2.6, Qwen 3.6 Max Preview, Xiaomi MiMo V2.5 Pro, Tencent Hy3 Preview, and Mistral Medium 3.5 High Reasoning added

The benchmark uses adversarial, multi-turn debates across 683 curated motions. Each model pair debates the same motion twice with sides swapped. Scores are Bradley-Terry ratings over side-swapped matchups, reported on an Elo-like scale centered around 1500 for the comparison pool. The benchmark also tracks a judge-side entertainment diagnostic as a secondary signal. Each completed debate is intended to be judged by a three-model panel. Mean cross-judge winner agreement on overlapping side-swapped matchups: 0.55.