AI SAFETY & ETHICS
Negation Neglect: When models fail to learn negations in training
LessWrong AI
•
This is a short summary of our new paper: arXi, X thread, code. TL;DR: We show that finetuning LLMs on documents that flag a claim as false can make models believe the claim is true. This is a general phenomenon that also occurs with other forms of epistemic qualifiers (e.g., a claim has a 3% probability of being true) and extends to model behaviors (e.g., warning against types of misalignment). This effect occurs in all models tested. Authors: Harry Mayne*, Le McKinney*, Jan DubiĆski, Adam Karvonen, James Chua, Owain Evans (* Equal Contribution). Negation Neglect in our main experiment.