Consistency Training Could Help Limit Sycophancy and Jailbreaks

Authors: Alex Irpan* and Alex Turner*, Mark Kurzeja, David Elson, and Rohin Shah Blog post accompanying the full paper available on Arxi. “You’re absolutely right!” Even the smartest models’ factuality or refusal