A reinforcement learning algorithm that trains models using group-sampled reward comparisons.
Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm designed to fine-tune large language models using reward signals, without requiring a separate critic or value network. Introduced by DeepSeek researchers in 2024, GRPO generates a group of candidate outputs for each input prompt, scores them using a reward model, and computes advantages by comparing each output's reward against the group mean. This relative scoring approach eliminates the need for a learned baseline or value function, substantially reducing memory and compute overhead compared to methods like PPO that maintain a separate critic network of comparable size.
The core update rule in GRPO resembles PPO's clipped surrogate objective but replaces per-token value estimates with group-normalized advantage signals. For each sampled output, the advantage is computed as the reward minus the group mean, normalized by the group standard deviation. A KL divergence penalty against a reference policy is added to prevent the model from drifting too far from its supervised fine-tuning initialization. This combination of group-relative baselines and KL regularization produces stable, sample-efficient updates that scale well to large language models.
GRPO gained significant attention through its use in training DeepSeek-R1 and related reasoning-focused models, where it enabled models to develop extended chain-of-thought reasoning behaviors through reinforcement learning on verifiable tasks such as mathematics and code generation. Because correctness on such tasks can be checked automatically, GRPO can be applied without human preference labels, making it attractive for scalable self-improvement pipelines. The algorithm demonstrated that competitive reasoning performance could be achieved with substantially less infrastructure than critic-based RL methods.
More broadly, GRPO represents a trend toward lightweight, critic-free policy optimization methods suited to the specific demands of language model fine-tuning. Its design choices—group sampling, relative advantage estimation, and reference-policy regularization—address the high variance and instability that plague naive policy gradient approaches when applied to autoregressive generation. As reinforcement learning from verifiable rewards becomes a standard component of LLM training pipelines, GRPO has emerged as a practical and widely adopted baseline algorithm.