Fast-Slow Thinking RM: Efficient Integration of Scalar and Generative Reward Models

ArXi:2603.20212v1 Announce Type: cross Reward models (RMs) are critical for aligning Large Language Models via Reinforcement Learning from Human Feedback (RLHF). While Generative Reward Models (GRMs) achieve superior accuracy through chain-of-thought (CoT) reasoning, they incur substantial computational costs. Conversely, Scalar Reward Models (SRMs) offer efficiency but suffer from limited performance and adaptability in complex scenarios. F/S-RM achieves a 1.2% relative performance improvement over state-of-the-art models while reducing token consumption by 20.8.