AI SAFETY & ETHICS

“Act-based approval-directed agents”, for IDA skeptics

Alignment Forum

Summary / tl;dr In the 2010s, Paul Christiano built an extensive body of work on AI alignment - see the “Iterated Amplification” series for a curated overview as of 2018. One foundation of this program was an intuition that it should be possible to build “act-based approval-directed agents” (“approval-directed agents” for short). These AGIs, for example, would not lie to their human supervisors, because their human supervisors wouldn’t want them to lie, and these AGIs would only do things that their human supervisors would want them to do.