Reducing Hallucinations in LLMs via Factuality-Aware Preference Learning

ArXi:2601.03027v3 Announce Type: replace Preference alignment methods such as RLHF and Direct Preference Optimization (DPO) improve instruction following, but they can also reinforce hallucinations when preference judgments reward fluency and confidence over factual correctness. We