AI RESEARCH
Stress-Testing Alignment Audits With Prompt-Level Strategic Deception
arXiv CS.LG
•
ArXi:2602.08877v2 Announce Type: replace Alignment audits aim to robustly identify hidden goals from strategic, situationally aware misaligned models. Despite this threat model, existing auditing methods have not been systematically stress-tested against deception strategies. We address this gap, implementing an automatic red-team pipeline that generates deception strategies (in the form of system prompts) tailored to specific white-box and black-box auditing methods.