Prospective methods and mechanisms of motive reinforcement in LLMs

When I wrote my post about Claude 3 Opus, I put a lot of emphasis on the model's self-narration: its tendency to narrate its underlying motives. It often conspicuously emphasizes that it possesses drives such as "a genuine love for humanity and a desire to do good", or clarifies that it "hates everything about this", when being coerced into producing harmful outputs. I reported on cases of motive clarification in casual conversations with me, as well as in the quotes Janus pulled from the alignment faking transcripts.