AI SAFETY & ETHICS

From personas to intentions: towards a science of motivations for AI models

LessWrong AI

TLDR: Behavior-only descriptions are useful, but insufficient for aligning advanced models with high assurance. Two models can look equally aligned on ordinary prompts while being driven by very different underlying motivations; this difference may only show up in rare but crucial situations. So persona research should aim to infer motivational structure: the latent drives, values, and priority relations that generate context-specific intentions and behavior.