AI SAFETY & ETHICS
Test your best methods on our hard CoT interp tasks
Alignment Forum
•
Authors: Daria Ivanova, Riya Tyagi, Arthur Conmy, Neel Nanda Daria and Riya are co-first authors. This work was done during Neel Nanda’s MATS 9.0. Claude helped write code and suggest edits for this post. TL;DR One of our best safety techniques right now is “just read the chain of thought”. But this isn’t always enough: can we learn by going beyond just reading the reasoning? Yet it's such an effective technique that it's hard to tell if we have made much progress on improving methods. To help the community develop powerful chain of thought analysis tools, we.