Measuring Safety Alignment Effects in Autonomous Security Agents

ArXi:2605.19722v1 Announce Type: cross Do stock safety-aligned language models and their uncensored or abliterated derivatives behave differently when run as autonomous security agents? Single-turn refusal benchmarks cannot answer this question: security agents must inspect repositories, call tools, and produce vulnerability evidence inside authorized sandboxes.