How do LLMs generalize when we do training that is intuitively compatible with two off-distribution behaviors?

Authors: Dylan Xu, Alek Westover, Vivek Hebbar, Sebastian Prasanna, Nathan Sheffield, Buck Shlegeris, Julian Stastny Thanks to Eric Gan and Aghyad Deeb for feedback on a draft of this post. When is a “ deceptively aligned ” policy capable of surviving