Seamless Deception: Larger Language Models Are Better Knowledge Concealers

ArXi:2603.14672v1 Announce Type: cross Language Models (LMs) may acquire harmful knowledge, and yet feign ignorance of these topics when under audit. Inspired by the recent discovery of deception-related behaviour patterns in LMs, we aim to train classifiers that detect when a LM is actively concealing knowledge. Initial findings on smaller models show that classifiers can detect concealment reliably than human evaluators, with gradient-based concealment proving easier to identify than prompt-based methods.