AI RESEARCH
Why Safety Probes Catch Liars But Miss Fanatics
arXiv CS.AI
•
ArXi:2603.25861v1 Announce Type: cross Activation-based probes have emerged as a promising approach for detecting deceptively aligned AI systems by identifying internal conflict between true and stated goals. We identify a fundamental blind spot: probes fail on coherent misalignment - models that believe their harmful behavior is virtuous rather than strategically hiding it. We prove that no polynomial-time probe can detect such misalignment with non-trivial accuracy when belief structures reach sufficient complexity (PRF-like triggers.