Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. GRPO (Group Relative Policy Optimization)

GRPO (Group Relative Policy Optimization)

A reinforcement learning algorithm that trains models using group-sampled reward comparisons.

Year: 2024Generality: 292
Back to Vocab

Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm designed to fine-tune large language models using reward signals, without requiring a separate critic or value network. Introduced by DeepSeek researchers in 2024, GRPO generates a group of candidate outputs for each input prompt, scores them using a reward model, and computes advantages by comparing each output's reward against the group mean. This relative scoring approach eliminates the need for a learned baseline or value function, substantially reducing memory and compute overhead compared to methods like PPO that maintain a separate critic network of comparable size.

The core update rule in GRPO resembles PPO's clipped surrogate objective but replaces per-token value estimates with group-normalized advantage signals. For each sampled output, the advantage is computed as the reward minus the group mean, normalized by the group standard deviation. A KL divergence penalty against a reference policy is added to prevent the model from drifting too far from its supervised fine-tuning initialization. This combination of group-relative baselines and KL regularization produces stable, sample-efficient updates that scale well to large language models.

GRPO gained significant attention through its use in training DeepSeek-R1 and related reasoning-focused models, where it enabled models to develop extended chain-of-thought reasoning behaviors through reinforcement learning on verifiable tasks such as mathematics and code generation. Because correctness on such tasks can be checked automatically, GRPO can be applied without human preference labels, making it attractive for scalable self-improvement pipelines. The algorithm demonstrated that competitive reasoning performance could be achieved with substantially less infrastructure than critic-based RL methods.

More broadly, GRPO represents a trend toward lightweight, critic-free policy optimization methods suited to the specific demands of language model fine-tuning. Its design choices—group sampling, relative advantage estimation, and reference-policy regularization—address the high variance and instability that plague naive policy gradient approaches when applied to autoregressive generation. As reinforcement learning from verifiable rewards becomes a standard component of LLM training pipelines, GRPO has emerged as a practical and widely adopted baseline algorithm.

Related

Related

TRPO (Trust Region Policy Optimization)
TRPO (Trust Region Policy Optimization)

A reinforcement learning algorithm that ensures stable policy updates via constrained optimization.

Generality: 620
PPO (Proximal Policy Optimization)
PPO (Proximal Policy Optimization)

A stable, efficient reinforcement learning algorithm using clipped policy updates.

Generality: 694
Policy Gradient Algorithm
Policy Gradient Algorithm

Reinforcement learning method that directly optimizes a policy by following reward gradients.

Generality: 728
DPO (Direct Preference Optimization)
DPO (Direct Preference Optimization)

A training method that fine-tunes language models directly from human preference data.

Generality: 494
Policy Gradient
Policy Gradient

Reinforcement learning algorithms that optimize a policy directly via gradient ascent on expected rewards.

Generality: 796
MDPO (Mirror Descent Policy Optimization)
MDPO (Mirror Descent Policy Optimization)

A reinforcement learning algorithm using mirror descent for stable, geometry-aware policy updates.

Generality: 292