Benchmarking Misuse Mitigation Against Covert Adversaries

ArXi:2506.06414v2 Announce Type: replace-cross Existing language model safety evaluations focus on overt attacks and low-stakes tasks. In reality, an attacker can easily subvert existing safeguards by requesting help on small, benign-seeming tasks across many independent queries. Because the individual queries do not appear harmful, the attack is hard to detect. However, when combined, these fragments uplift misuse by helping the attacker complete hard and dangerous tasks.