Rationale Matters: Learning Transferable Rubrics via Proxy-Guided Critique for VLMReward Models

ArXi:2603.16600v1 Announce Type: new Generative reward models (GRMs) for vision-language models (VLMs) often evaluate outputs via a three-stage pipeline: rubric generation, criterion-based scoring, and a final verdict. However, the intermediate rubric is rarely optimized directly. Prior work typically either treats rubrics as incidental or relies on expensive LLM-as-judge checks that provide no differentiable signal and limited