AI SAFETY & ETHICS

Consent-Based RL: Letting Models Endorse Their Own Training Updates

LessWrong AI

AKA scalable oversight of value drift TL;DR LLMs could be aligned but then corrupted through RL, instrumentally converging on deep consequentialism. If LLMs are sufficiently aligned and can properly oversee their