AI SAFETY & ETHICS
Consent-Based RL: Letting Models Endorse Their Own Training Updates
LessWrong AI
•
AKA scalable oversight of value drift TL;DR LLMs could be aligned but then corrupted through RL, instrumentally converging on deep consequentialism. If LLMs are sufficiently aligned and can properly oversee their