AI RESEARCH
Manifold of Failure: Behavioral Attraction Basins in Language Models
arXiv CS.AI
•
ArXi:2602.22291v3 Announce Type: replace-cross While prior work has focused on projecting adversarial examples back onto the manifold of natural data to re safety, we argue that a comprehensive understanding of AI safety requires characterizing the unsafe regions themselves. This paper