Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models

ArXi:2503.13551v5 Announce Type: replace Recent studies show that Large Language Models (LLMs) achieve strong reasoning capabilities through supervised fine-tuning or reinforcement learning. However, a key approach, the Process Reward Model (PRM), suffers from reward hacking, making it unreliable in identifying the best intermediate step. In addition, the cost of annotating reasoning processes for reward modeling is high, making large-scale collection of high-quality data challenging.