AI RESEARCH

DARC: Disagreement-Aware Alignment via Risk-Constrained Decoding

arXiv CS.LG

ArXi:2603.08145v1 Announce Type: new Preference-based alignment methods (e.g., RLHF, DPO) typically optimize a single scalar objective, implicitly averaging over heterogeneous human preferences. In practice, systematic annotator and user-group disagreement makes mean-reward maximization brittle and susceptible to proxy over-optimization. We propose **Disagreement-Aware Alignment via Risk-Constrained Decoding (DARC)**, a re