Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations

ArXi:2603.17305v1 Announce Type: cross We propose CRAFT, a red-teaming alignment framework that leverages model reasoning capabilities and hidden representations to improve robustness against jailbreak attacks. Unlike prior defenses that operate primarily at the output level, CRAFT aligns large reasoning models to generate safety-aware reasoning traces by explicitly optimizing objectives defined over the hidden state space.