AI RESEARCH

SafeDream: Safety World Model for Proactive Early Jailbreak Detection

arXiv CS.AI

ArXi:2604.16824v1 Announce Type: cross Multi-turn jailbreak attacks progressively erode LLM safety alignment across seemingly innocuous conversation turns, achieving success rates exceeding 90% against state-of-the-art models. Existing alignment-based and guardrail methods suffer from three key limitations: they require costly weight modification, evaluate each turn independently without modeling cumulative safety erosion, and detect attacks only after harmful content has been generated.