AI SAFETY & ETHICS

Will reward-seekers respond to distant incentives?

Alignment Forum

Reward-seekers are usually modeled as responding only to local incentives administered by developers. Here I ask: Will AIs or humans be able to influence their incentives at a distance - e.g., by retroactively reinforcing actions substantially in the future or by committing to run many copies of them in simulated deployments with different incentives? If reward-seekers are responsive to distant incentives, it fundamentally changes the threat model, and is probably bad news for developers on balance.