Test your best methods on our hard CoT interp tasks

Authors: Daria Ivanova, Riya Tyagi, Josh Engels, Neel Nanda Daria and Riya are co-first authors. This work was done during Neel Nanda’s MATS 9.0. Claude helped write code and suggest edits for this post. Most of our tasks fall in 3 categories: predicting future actions, detecting the effect of an intervention, and identifying distributional properties of a rollout. TL;DR One of our best safety techniques right now is “just read the chain of thought