AI SAFETY & ETHICS
Incriminating misaligned AI models via distillation
LessWrong AI
•
Suppose we have a dangerous misaligned AI that can fool alignment audits, and distill it into a student model. Two things can happen: …