AI SAFETY & ETHICS
Sycophancy Towards Researchers Drives Performative Misalignment
LessWrong AI
•
This work was done by Rustem Turtaye, David Vella Zarb, and Taywon Min during MATS 9.0, mentored by Shi Feng, based on prior work by David Baek. We are grateful to our research manager Jinghua Ou for helpful suggestions on this blog post. TL;DR: in this construct validity exercise, we didn't find clear evidence that scheming is valid of an explanation than sycophancy towards researchers for alignment faking.