AI RESEARCH

Probing the Limits of the Lie Detector Approach to LLM Deception

arXiv CS.LG

ArXi:2603.10003v1 Announce Type: cross Mechanistic approaches to deception in large language models (LLMs) often rely on "lie detectors", that is, truth probes trained to identify internal representations of model outputs as false. The lie detector approach to LLM deception implicitly assumes that deception is coextensive with lying. This paper challenges that assumption. It experimentally investigates whether LLMs can deceive without producing false statements and whether truth probes fail to detect such behavior.