Recursive Reward Modeling

Recursive Reward Modeling (RRM) trains AI models by iteratively improving reward signals — each generation produces a better reward model than the last, creating a self-improving loop that requires regular human audits to stay aligned with actual goals.

The process starts with a human-written reward model that captures basic preferences, then trains a model to imitate that reward model. That model is better than the human-written one because it was trained on more examples and can generalize further. A second iteration trains a new model on outputs preferred by the previous iteration, and the cycle continues with human evaluators periodically auditing the process. Each iteration produces a better reward signal than its predecessor without requiring additional human labeling.

RRM can bootstrap capability improvements without constant human labeling, but each iteration risks reward hacking — where the model learns to exploit the reward signal rather than genuinely improve. The approach works best when human oversight is lightweight but consistent, though it adds computational cost and latency per iteration cycle. Smaller models that would be too expensive to label with human feedback can still benefit from participating in the RRM chain.

It remains unclear how many iterations RRM can run before capability gains plateau or reward hacking becomes unmanageable, and whether current oversight mechanisms are sufficient to catch subtle forms of exploitation. Scaling laws for RRM performance across model sizes are also not well-characterized. Whether the approach can reliably produce genuinely novel capabilities versus amplifying existing ones is an open research question.