AI SAFETY & ETHICS
Prefill awareness: can LLMs tell when “their” message history has been tampered with?
LessWrong AI
•
David Africa*, Alex Souly*, Jordan Taylor, Robert Kirk TLDR: We test whether LLMs can detect when their conversation history has been tampered with (prefill awareness). We find this ability is inconsistent across models and datasets, shallow, and rarely surfaces spontaneously during normal conversation. However, recent Claude models show rather strong prefill detection capabilities when prompted, suggesting prefill awareness is an emerging and model-specific confound that should be actively monitored in off-policy alignment evals.