Reinforcement learning approach that directly optimizes a policy to maximize cumulative reward.
Policy learning is a core paradigm in reinforcement learning focused on directly optimizing a policy — a function that maps observed states to actions — so as to maximize expected cumulative reward over time. Rather than first learning a value function and deriving behavior from it, policy learning methods treat the policy itself as the primary object of optimization, adjusting its parameters directly through experience. This distinction makes policy learning especially powerful in settings with continuous or high-dimensional action spaces, where value-based approaches like Q-learning become computationally intractable.
The dominant family of techniques in policy learning is policy gradient methods, which estimate the gradient of expected return with respect to policy parameters and apply gradient ascent to improve performance. The foundational REINFORCE algorithm formalized this approach by using Monte Carlo rollouts to estimate policy gradients. Actor-critic architectures extend this idea by pairing a policy network (the actor) with a learned value function (the critic), reducing the variance of gradient estimates and improving sample efficiency. More modern algorithms such as Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) introduced principled constraints on policy updates to prevent destabilizing parameter changes, dramatically improving training stability and reliability.
Policy learning is particularly well-suited to stochastic policies, which are essential in partially observable environments or multi-agent settings where deterministic behavior can be exploited. It also integrates naturally with function approximators like deep neural networks, giving rise to deep policy gradient methods that have achieved remarkable results across domains including robotic locomotion, game playing, and natural language generation via reinforcement learning from human feedback (RLHF).
The practical impact of policy learning has been substantial. It underpins breakthroughs such as AlphaGo's game-playing ability, state-of-the-art robotic control systems, and the fine-tuning of large language models to align with human preferences. Its flexibility — handling continuous actions, stochastic behavior, and end-to-end differentiable pipelines — makes it one of the most versatile and actively researched areas in modern machine learning.