AI RESEARCH
Pressure, What Pressure? Sycophancy Disentanglement in Language Models via Reward Decomposition
arXiv CS.AI
•
ArXi:2604.05279v1 Announce Type: new Large language models exhibit sycophancy, the tendency to shift their stated positions toward perceived user preferences or authority cues regardless of evidence. Standard alignment methods fail to correct this because scalar reward models conflate two distinct failure modes into a single signal: pressure capitulation, where the model changes a correct answer under social pressure, and evidence blindness, where the model ignores the provided context entirely.