Diagnosing and Mitigating Sycophancy and Skepticism in LLM Causal Judgment

ArXi:2601.08258v3 Announce Type: replace Large language models increasingly fail in a way that scalar accuracy cannot diagnose: they produce a sound reasoning trace and then abandon it under social pressure or an authoritative hint. We argue that this is a control failure, not a knowledge failure, and that it requires an evaluation surface richer than a single accuracy number. We