AntiPaSTO: Self-Supervised Honesty Steering via Anti-Parallel Representations

ArXi:2601.07473v4 Announce Type: replace As models grow capable, humans cannot reliably verify what they say. Scalable steering requires methods that are internal, self-supervised, and transfer out-of-distribution; existing methods satisfy some but not all three. We