Linear probes rely on textual evidence: Results from leakage mitigation studies in language models

ArXi:2509.21344v2 Announce Type: replace-cross White-box monitors are a popular technique for detecting potentially harmful behaviours in language models. While they perform well in general, their effectiveness in detecting text-ambiguous behaviour is disputed. In this work, we find evidence that removing textual evidence of a behaviour significantly decreases probe performance. The AUROC reduction ranges from $10$- to $30$-point depending on the setting.