Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms

ArXi:2605.08496v1 Announce Type: new Current adversarial robustness methods for large language models require extensive datasets of harmful prompts (thousands to hundreds of thousands of examples), yet remain vulnerable to novel attack vectors and distributional shifts. We propose Latent Personality Alignment (LPA), a sample-efficient defense that achieves robustness by