A model that evaluates intermediate reasoning steps rather than only final answers.
A Process Reward Model (PRM) is a type of reward model used in reinforcement learning from human feedback (RLHF) and reasoning-focused AI systems that assigns scores to individual steps in a chain of thought or problem-solving process, rather than evaluating only the final output. This contrasts with Outcome Reward Models (ORMs), which provide a single reward signal based on whether the final answer is correct. By scoring each intermediate step, PRMs offer a more granular supervisory signal that can distinguish between a lucky correct answer arrived at through flawed reasoning and a correct answer produced through sound logic.
In practice, PRMs are trained on datasets where human annotators—or automated verifiers—label each step in a multi-step solution as correct, incorrect, or uncertain. During inference or training of a language model, the PRM evaluates candidate reasoning chains step by step, enabling techniques like best-of-N sampling or beam search to select the most reliable reasoning path rather than simply the most fluent one. This makes PRMs especially valuable in mathematical problem-solving, code generation, and scientific reasoning, where intermediate correctness is as important as the final result.
The significance of PRMs lies in their ability to provide dense, informative feedback in domains where sparse outcome-based rewards are insufficient for reliable learning. When a model only receives a signal at the end of a long reasoning chain, credit assignment becomes difficult—it is hard to identify which steps contributed to success or failure. PRMs address this by localizing feedback, making training more efficient and the resulting models more trustworthy. They also improve interpretability, since a well-calibrated PRM can flag exactly where a model's reasoning goes wrong.
Process Reward Models gained prominence following OpenAI's 2023 research on solving competition mathematics problems, which demonstrated that PRMs substantially outperformed outcome-based models when used to guide search over reasoning steps. Since then, PRMs have become a central component in efforts to build more reliable and verifiable AI reasoning systems, particularly as large language models are increasingly deployed in high-stakes domains requiring multi-step logical inference.