ConsistRM: Improving Generative Reward Models via Consistency-Aware Self-Training

ArXi:2604.07484v1 Announce Type: cross Generative reward models (GRMs) have emerged as a promising approach for aligning Large Language Models (LLMs) with human preferences by offering greater representational capacity and flexibility than traditional scalar reward models. However, GRMs face two major challenges: reliance on costly human-annotated data restricts scalability, and self-