Training models using signals from intermediate reasoning steps, not just final outputs.
Process-based supervision is a training paradigm in which learning signals are derived from the intermediate steps, reasoning chains, or action trajectories that produce an output, rather than relying exclusively on whether the final answer is correct. This contrasts with outcome-based supervision, where a model receives feedback only on its end result. By targeting the process itself, this approach aims to shape how a model arrives at answers, not merely what answers it produces.
In practice, process-based supervision takes many forms depending on the domain. In imitation learning and robotics, it means training agents to match expert action sequences at each timestep rather than just reaching a goal state. In large language model (LLM) training, it involves annotating intermediate reasoning steps—such as chain-of-thought traces or mathematical proof steps—and rewarding models for producing correct intermediate logic, not just correct final answers. Process reward models (PRMs) are a concrete instantiation: separate models trained to score the quality of each reasoning step, providing dense feedback signals throughout a solution rather than a single terminal reward. In neural program synthesis, it can mean supervising against program execution traces.
The core motivation is improved credit assignment and generalization. When only final outputs are supervised, models can learn shortcuts or spurious correlations that happen to produce correct answers on training data but fail on novel inputs. Supervising intermediate steps encourages models to internalize causal, transferable reasoning patterns. This also yields interpretability benefits: a model whose intermediate steps are explicitly trained is more auditable than one that produces correct outputs through opaque internal computations. Research has shown that process supervision can substantially reduce reasoning errors in complex mathematical and scientific problem-solving tasks.
The main challenges are practical: obtaining high-quality process annotations is expensive and requires domain expertise, since annotators must evaluate not just whether an answer is right but whether each step is valid. There is also a risk of over-constraining models to follow particular solution paths, potentially limiting discovery of novel or more efficient approaches. Despite these trade-offs, process-based supervision has become a central technique in aligning LLMs for multi-step reasoning, with significant adoption in both academic research and production AI systems since roughly 2022.