AI RESEARCH

ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models

arXiv CS.AI

ArXi:2603.10068v1 Announce Type: cross Most adversarial evaluations of large language model (LLM) safety assess single prompts and report binary pass/fail outcomes, which fails to capture how safety properties evolve under sustained adversarial interaction. We present ADVERSA, an automated red-teaming framework that measures guardrail degradation dynamics as continuous per-round compliance trajectories rather than discrete jailbreak events.