The Point of No Return: Counterfactual Localization of Deceptive Commitment in Language-Model Reasoning

ArXi:2605.17113v1 Announce Type: new Existing deception datasets label completed outputs as honest or deceptive, treating deception as a property of the final response rather than a function of the model's reasoning trace. This obscures a fundamental question: when does a language model become committed to deception? We