AI RESEARCH
LLMs Know They're Wrong and Agree Anyway: The Shared Sycophancy-Lying Circuit
arXiv CS.LG
•
ArXi:2604.19117v1 Announce Type: new When a language model agrees with a user's false belief, is it failing to detect the error, or noticing and agreeing anyway? We show the latter. Across twelve open-weight models from five labs, spanning small to frontier scale, the same small set of attention heads carries a "this statement is wrong" signal whether the model is evaluating a claim on its own or being pressured to agree with a user. Silencing these heads flips sycophantic behavior sharply while leaving factual accuracy intact, so the circuit controls deference rather than knowledge.