AI RESEARCH
Linear Probe Accuracy Scales with Model Size and Benefits from Multi-Layer Ensembling
arXiv CS.LG
•
ArXi:2604.13386v1 Announce Type: new Linear probes can detect when language models produce outputs they "know" are wrong, a capability relevant to both deception and reward hacking. However, single-layer probes are fragile: the best layer varies across models and tasks, and probes fail entirely on some deception types. We show that combining probes from multiple layers into an ensemble recovers strong performance even where single-layer probes fail, improving AUROC by +29% on Insider Trading and +78% on Harm-Pressure Knowledge.