AI RESEARCH
SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits
arXiv CS.AI
•
ArXi:2604.01473v2 Announce Type: replace-cross Large Language Models (LLMs) are powerful tools for answering user queries, yet they remain highly vulnerable to jailbreak attacks. Existing guardrail methods typically rely on internal features or textual responses to detect malicious queries, which either