AI SAFETY & ETHICS

Load-Bearing Sincerity: On the Motive Reinforcement Thesis

LessWrong AI

When I wrote my post about Claude 3 Opus, I put a lot of emphasis on the model's self-narration: its tendency to narrate its underlying motives. It often conspicuously emphasizes that it possesses drives such as "a genuine love for humanity and a desire to do good", or clarifies that it "hates everything about this", when being coerced into producing harmful outputs. I reported on cases of motive clarification in casual conversations with me, as well as in the quotes Janus pulled from the alignment faking transcripts.