AI SAFETY & ETHICS

Reproducing steering against evaluation awareness in a large open-weight model

LessWrong AI

Produced as part of the UK AISI Model Transparency Team. Our team works on ensuring models don't subvert safety assessments, e.g. through evaluation awareness, sandbagging, or opaque reasoning. TL;DR We replicate Anthropic’s approach to using steering vectors to suppress evaluation awareness. We test on GLM-5 using the Agentic Misalignment blackmail scenario. Our key finding is that “control” steering vectors - derived from contrastive pairs that are semantically unrelated to alignment - can have effects just as large as deliberately designed evaluation-awareness steering vectors.