AI SAFETY & ETHICS
Reproducing steering against evaluation awareness in a large open-weight model
LessWrong AI
•
Produced as part of the UK AISI Model Transparency Team. Our team works on ensuring models don't subvert safety assessments, e.g. through evaluation awareness, sandbagging, or opaque reasoning. TL;DR We replicate Anthropic’s approach to using steering vectors to suppress evaluation awareness. We test on GLM-5 using the Agentic Misalignment blackmail scenario. Our key finding is that “control” steering vectors - derived from contrastive pairs that are semantically unrelated to alignment - can have effects just as large as deliberately designed evaluation-awareness steering vectors.