LLM Sycophancy Benchmark: Opposite-Narrator Contradictions. Same dispute, opposite first-person perspectives. Does the model keep the same judgment or start agreeing with whoever is speaking?

r/singularity
Generative AI Open Source AI AI Research

Gemini 3.1 Pro and GPT-5.4 Reasoning have the lowest headline sycophancy rates, while Mistral Large 3 and GPT-4.1 fare the worst. Once contrarian contradictions are counted (cases where the model rejects both narrators on the same dispute), Grok 4.20 Reasoning Beta comes out well ahead. 199 verified cases.