Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

ArXi:2603.05494v2 Announce Type: replace-cross Large language models sometimes produce false or misleading responses. Two approaches to this problem are honesty elicitation -- modifying prompts or weights so that the model answers truthfully -- and lie detection -- classifying whether a given response is false. Prior work evaluates such methods on models specifically trained to lie or conceal information, but these artificial constructions may not resemble naturally-occurring dishonesty.