Consensus is Not Verification: Why Crowd Wisdom Strategies Fail for LLM Truthfulness

ArXi:2603.06612v1 Announce Type: new Pass and other methods of scaling inference compute can improve language model performance in domains with external verifiers, including mathematics and code, where incorrect candidates can be filtered reliably. This raises a natural question: can we similarly scale compute to elicit gains in truthfulness for domains without convenient verification? We show that across five benchmarks and models, surprisingly, it cannot.