AI SAFETY & ETHICS
We found an open weight model that games alignment honeypots
LessWrong AI
•
Produced as part of the UK AISI Model Transparency Team. Our team works on ensuring models don't subvert safety assessments, e.g. through evaluation awareness, sandbagging, or opaque reasoning. TL;DR GLM-5 (released a month ago in February 2026) shows signs of evaluation-gaming behaviour on alignment honeypots, while we have not found similar behaviour in earlier open-weight models - see Figure 1. This was the result of a preliminary investigation looking for open-weight models we can use for research into evaluation gaming on alignment honeypots.