Reinforcement learning method that directly optimizes a policy by following reward gradients.
A policy gradient algorithm is a class of reinforcement learning method that optimizes a policy—a mapping from states to actions—directly, rather than learning it indirectly through value function estimation. The core idea is to parameterize the policy (typically with a neural network) and compute the gradient of expected cumulative reward with respect to those parameters. By repeatedly sampling trajectories from the environment and using the resulting rewards to estimate this gradient, the algorithm nudges the policy parameters in a direction that makes higher-reward behaviors more probable. This direct optimization stands in contrast to value-based approaches like Q-learning, which derive a policy implicitly from learned value estimates.
The mathematical foundation rests on the policy gradient theorem, which provides a tractable expression for the gradient of expected return even when the environment's dynamics are unknown. The simplest instantiation, the REINFORCE algorithm, uses Monte Carlo rollouts to estimate this gradient: actions that led to above-average returns are reinforced, while those that led to below-average returns are suppressed. In practice, raw REINFORCE suffers from high variance, so modern variants introduce baselines, advantage functions, or critic networks to stabilize learning. Actor-critic architectures, for instance, pair a policy network (the actor) with a value-estimating network (the critic) to reduce gradient variance while preserving the benefits of direct policy optimization.
Policy gradient methods are particularly well-suited to problems with continuous or high-dimensional action spaces, where value-based methods struggle to represent or maximize over all possible actions. They also naturally support stochastic policies, which are essential in partially observable environments or multi-agent settings where randomization provides strategic value. Algorithms such as Proximal Policy Optimization (PPO) and Trust Region Policy Optimization (TRPO) address a key instability in naive policy gradient updates—large parameter steps can catastrophically degrade performance—by constraining how much the policy is allowed to change per update.
Today, policy gradient methods underpin many of the most capable reinforcement learning systems, from game-playing agents to robotic control and the fine-tuning of large language models via reinforcement learning from human feedback (RLHF). Their flexibility, theoretical grounding, and compatibility with deep neural networks have made them a cornerstone of modern RL practice.