Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Policy Gradient

Policy Gradient

Reinforcement learning algorithms that optimize a policy directly via gradient ascent on expected rewards.

Year: 1992Generality: 796
Back to Vocab

Policy gradient methods are a family of reinforcement learning algorithms that optimize a policy's parameters directly, rather than deriving behavior indirectly from estimated value functions. The core idea is to parameterize the policy — often as a neural network — and compute the gradient of expected cumulative reward with respect to those parameters. By following this gradient upward through repeated updates, the agent learns to take actions that yield higher long-term returns. This direct optimization makes policy gradient methods particularly well-suited to continuous action spaces, where enumerating or maximizing over all possible actions is computationally intractable.

The foundational algorithm in this family is REINFORCE, introduced by Ronald Williams in 1992, which estimates the policy gradient using sampled trajectories of experience. The key insight is that even though the reward signal is non-differentiable with respect to the actions taken, the log-probability of those actions under the policy is differentiable — enabling gradient-based optimization. In practice, raw REINFORCE suffers from high variance in gradient estimates, making learning slow and unstable. Subtracting a baseline — typically an estimate of the state's value — reduces this variance without introducing bias, a technique central to modern implementations.

Actor-Critic methods extend this framework by maintaining two components: an actor that represents the policy and a critic that estimates a value function. The critic's estimates serve as low-variance baselines or advantage signals, guiding the actor's updates more efficiently than Monte Carlo returns alone. This architecture underlies many state-of-the-art algorithms, including Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC), which add further stabilization through clipped objectives or entropy regularization.

Policy gradient methods matter because they are among the most general and flexible tools in deep reinforcement learning. They handle stochastic policies naturally, support exploration through entropy bonuses, and scale to high-dimensional action spaces. Their ability to optimize non-differentiable reward signals end-to-end has made them central to breakthroughs in robotics, game playing, and large language model alignment through techniques like reinforcement learning from human feedback (RLHF).

Related

Related

Policy Gradient Algorithm
Policy Gradient Algorithm

Reinforcement learning method that directly optimizes a policy by following reward gradients.

Generality: 728
Policy Learning
Policy Learning

Reinforcement learning approach that directly optimizes a policy to maximize cumulative reward.

Generality: 794
Policy Parameters
Policy Parameters

Learnable weights that define how a reinforcement learning agent selects actions.

Generality: 581
PPO (Proximal Policy Optimization)
PPO (Proximal Policy Optimization)

A stable, efficient reinforcement learning algorithm using clipped policy updates.

Generality: 694
GRPO (Group Relative Policy Optimization)
GRPO (Group Relative Policy Optimization)

A reinforcement learning algorithm that trains models using group-sampled reward comparisons.

Generality: 292
TRPO (Trust Region Policy Optimization)
TRPO (Trust Region Policy Optimization)

A reinforcement learning algorithm that ensures stable policy updates via constrained optimization.

Generality: 620