Training on Documents About Monitoring Leads To CoT Obfuscation

Authors: Reilly Haskins*, Bilal Chughtai**, Joshua Engels** * primary contributor ** advice and mentorship Summary [Note: This is a research update sharing preliminary results as part of ongoing work] Will future models obfuscate their CoT when they that their CoT is being monitored? We investigate this question on today’s models by using synthetic document finetuning (SDF) on documents stating that the model will indeed have its CoT monitored.