AI SAFETY & ETHICS

To what extent is Qwen3-32B predicting its persona?

LessWrong AI

TL;DR We test to what extent Qwen3-32B behaves as though it is trying to predict what "Qwen3" would do. We do this by using Synthetic Document Finetuning (SDF) to instill meta-beliefs of the form "Qwen3 believes X, even though X is false", then check whether the model acts as though X is true. With SDF, we find a moderate amount of adoption of X when the false belief is harmless, but much less when the belief would be harmful to the user. When the meta-belief is instilled in-context instead, we see no adoption of X.