AI SAFETY & ETHICS
The safe-to-dangerous shift is a fundamental problem for eval realism; but also for measuring awareness
Alignment Forum
•
1) The safe-to-dangerous shift is a fundamental problem for eval realism Suppose we have a capable and potentially scheming model, and before we deploy it, we want some evidence that it won’t do anything catastrophically dangerous once we deploy it. A common approach is to use black-box alignment evaluations. However, alignment evaluations are only reassuring to the extent that the model can't reliably distinguish the deployment distribution from the evaluation distribution, as it is otherwise difficult to rule out the possibility of alignment faking.