CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models in Mathematical Reasoning

ArXi:2507.15698v2 Announce Type: replace-cross Process Reward Models (PRMs) play a central role in evaluating and guiding multi-step reasoning in large language models (LLMs), especially for mathematical problem solving. However, we identify a pervasive length bias in existing PRMs: they tend to assign higher scores to longer reasoning steps, even when the semantic content and logical validity are unchanged. This bias undermines the reliability of reward predictions and leads to overly verbose outputs during inference.