AI RESEARCH
Uncovering Logit Suppression Vulnerabilities in LLM Safety Alignment
arXiv CS.LG
•
ArXi:2405.13068v4 Announce Type: replace-cross Large language models (LLMs) have revolutionized various applications, making robust safety alignment essential to prevent harmful outputs. Current safety alignment techniques, however, harbor inherent vulnerabilities due to their reliance on logit suppression. In this work, we identify critical logit-level vulnerabilities by