AI RESEARCH

Attack Selection Reduces Safety in Concentrated AI Control Settings against Trusted Monitoring

arXiv CS.AI

ArXi:2602.04930v2 Announce Type: replace-cross Future AI deployments will likely be monitored for malicious behaviour. The ability of these AIs to subvert monitors by adversarially selecting against them - attack selection - is particularly concerning. To study this, we let a red team create attack policies that attempt to insert attacks into code without being caught by a monitor in the concentrated BigCodeBench backdooring setting. We decompose attack selection into two problems: mapping attacks to a quality score and mapping quality scores to submission probabilities.