AI SAFETY & ETHICS

Steering Might Stop Working Soon

LessWrong AI

Steering LLMs with single-vector methods might break down soon, and by soon I mean soon enough that if you're working on steering, you should start planning for it failing now. This is particularly important for things like steering as a mitigation against eval-awareness. Steering Humans I have a strong intuition that we will not be able to steer a superintelligence very effectively, partially for the same reason that you probably can't steer a human very effectively. I think weakly "steering" a human looks a lot like an intrusive thought.