AI SAFETY & ETHICS
Steering Might Stop Working Soon
LessWrong AI
•
Steering LLMs with single-vector methods might break down soon, and by soon I mean soon enough that if you're working on steering, you should start planning for it failing now. This is particularly important for things like steering as a mitigation against eval-awareness. Steering Humans I have a strong intuition that we will not be able to steer a superintelligence very effectively, partially for the same reason that you probably can't steer a human very effectively. I think weakly "steering" a human looks a lot like an intrusive thought.