Superficial Safety Alignment Hypothesis

ArXi:2410.10862v3 Announce Type: replace-cross As large language models (LLMs) are overwhelmingly and integrated into various applications, ensuring they generate safe responses is a pressing need. Previous studies on alignment have largely focused on general instruction-following but have often overlooked the distinct properties of safety alignment, such as the brittleness of safety mechanisms.