Reinforcement learning algorithms that optimize a policy directly via gradient ascent on expected rewards.
Policy gradient methods are a family of reinforcement learning algorithms that optimize a policy's parameters directly, rather than deriving behavior indirectly from estimated value functions. The core idea is to parameterize the policy — often as a neural network — and compute the gradient of expected cumulative reward with respect to those parameters. By following this gradient upward through repeated updates, the agent learns to take actions that yield higher long-term returns. This direct optimization makes policy gradient methods particularly well-suited to continuous action spaces, where enumerating or maximizing over all possible actions is computationally intractable.
The foundational algorithm in this family is REINFORCE, introduced by Ronald Williams in 1992, which estimates the policy gradient using sampled trajectories of experience. The key insight is that even though the reward signal is non-differentiable with respect to the actions taken, the log-probability of those actions under the policy is differentiable — enabling gradient-based optimization. In practice, raw REINFORCE suffers from high variance in gradient estimates, making learning slow and unstable. Subtracting a baseline — typically an estimate of the state's value — reduces this variance without introducing bias, a technique central to modern implementations.
Actor-Critic methods extend this framework by maintaining two components: an actor that represents the policy and a critic that estimates a value function. The critic's estimates serve as low-variance baselines or advantage signals, guiding the actor's updates more efficiently than Monte Carlo returns alone. This architecture underlies many state-of-the-art algorithms, including Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC), which add further stabilization through clipped objectives or entropy regularization.
Policy gradient methods matter because they are among the most general and flexible tools in deep reinforcement learning. They handle stochastic policies naturally, support exploration through entropy bonuses, and scale to high-dimensional action spaces. Their ability to optimize non-differentiable reward signals end-to-end has made them central to breakthroughs in robotics, game playing, and large language model alignment through techniques like reinforcement learning from human feedback (RLHF).