AI SAFETY & ETHICS

Incriminating misaligned AI models via distillation

LessWrong AI

Suppose we have a dangerous misaligned AI that can fool alignment audits, and distill it into a student model. Two things can happen: …