AI RESEARCH
When Safety Geometry Collapses: Fine-Tuning Vulnerabilities in Agentic Guard Models
arXiv CS.LG
•
ArXi:2605.02914v1 Announce Type: new A guard model fine-tuned on entirely benign data can lose all safety alignment -- not through adversarial manipulation, but through standard domain specialization. We nstrate this failure across three purpose-built safety classifiers -- LlamaGuard, WildGuard, and Granite Guardian -- deployed as protection layers in agentic AI pipelines, and show that it originates in the destruction of latent safety geometry: the structured harmful -- benign representational boundary that guides classification.