The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play

ArXi:2605.08427v1 Announce Type: new Self-play red team is an established approach to improving AI safety in which different instances of the same model play attacker and defender roles in a zero-sum game, i.e., where the attacker tries to jailbreak the defender; if self-play converges to a Nash equilibrium, the model is guaranteed to respond safely within the settings of the game. Although the parameter sharing enforced by the use of the same model for the two roles improves stability and performance, it.