Exposing Long-Tail Safety Failures in Large Language Models through Efficient Diverse Response Sampling

ArXi:2603.14355v1 Announce Type: new Safety tuning through supervised fine-tuning and reinforcement learning from human feedback has substantially improved the robustness of large language models (LLMs). However, it often suppresses rather than eliminates unsafe behaviors, leaving rare but critical failures hidden in the long tail of the output distribution.