New LLM Debate Benchmark: models debate the same motion twice with sides swapped in 10 turns. A wide variety of controversial and relevant topics. Sonnet 4.6 (high) wins. GLM-5 is the open weights leader.

r/singularity
Generative AI AI Research

Info, including charts, transcripts, LLM profiles, reports, and judgments: Xiaomi MiMo V2 Pro hits 10.4% content-block rate. Grok 4.20 Beta 0309 (Non-Reasoning) is at 3.8%. Each completed debate is judged by a panel of three judges drawn from six LLM judges: Sonnet 4.6 (high), GPT-5.4 (high), Gemini 3.1 Pro, Grok 4.20 Beta 0309 (Reasoning), Qwen3.5-397B-A17B, and Kimi K2.5 Thinking. Same-family judging against the debaters is avoided. The debate format is 10 turns: openings, 2 rebuttals, a pressure-question exchange, and closings. Rankings are Bradley-Terry over side-swapped matchups.