Uncovering Logit Suppression Vulnerabilities in LLM Safety Alignment

ArXi:2405.13068v4 Announce Type: replace-cross Large language models (LLMs) have revolutionized various applications, making robust safety alignment essential to prevent harmful outputs. Current safety alignment techniques, however, harbor inherent vulnerabilities due to their reliance on logit suppression. In this work, we identify critical logit-level vulnerabilities by