AI RESEARCH
AntiPaSTO: Self-Supervised Honesty Steering via Anti-Parallel Representations
arXiv CS.LG
•
ArXi:2601.07473v4 Announce Type: replace As models grow capable, humans cannot reliably verify what they say. Scalable steering requires methods that are internal, self-supervised, and transfer out-of-distribution; existing methods satisfy some but not all three. We