ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System

ArXi:2604.18789v1 Announce Type: new Reinforcement Learning from Human Feedback (RLHF) is central to aligning Large Language Models (LLMs), yet it We present ARES, a framework that systematically discovers and mitigates such dual vulnerabilities. ARES employs a ``Safety Mentor'' that dynamically composes semantically coherent adversarial prompts by combining structured component types (topics, personas, tactics, goals) and generates corresponding malicious and safe responses. This dual-targeting approach exposes weaknesses in both the core LLM and the RM simultaneously.