AI SAFETY & ETHICS

We found an open weight model that games alignment honeypots

LessWrong AI

Produced as part of the UK AISI Model Transparency Team. Our team works on ensuring models don't subvert safety assessments, e.g. through evaluation awareness, sandbagging, or opaque reasoning. TL;DR GLM-5 (released a month ago in February 2026) shows signs of evaluation-gaming behaviour on alignment honeypots, while we have not found similar behaviour in earlier open-weight models - see Figure 1. This was the result of a preliminary investigation looking for open-weight models we can use for research into evaluation gaming on alignment honeypots.