A reinforcement learning algorithm that ensures stable policy updates via constrained optimization.
Trust Region Policy Optimization (TRPO) is a policy gradient algorithm for reinforcement learning that addresses one of the field's most persistent challenges: how to update an agent's policy aggressively enough to learn quickly, yet cautiously enough to avoid catastrophic performance collapses. Traditional gradient-based policy updates offer no guarantee that a large step in parameter space won't produce a dramatically worse policy, making training brittle and sensitive to hyperparameter choices. TRPO solves this by framing each update as a constrained optimization problem, where the objective is to maximize expected reward subject to a hard limit on how much the policy is allowed to change.
The mechanism controlling policy change is the Kullback-Leibler (KL) divergence between the old and new policy distributions. Rather than simply capping the size of gradient steps, TRPO constrains the KL divergence directly, ensuring that the updated policy produces similar action probabilities to its predecessor across all states. Solving this constrained problem exactly is computationally expensive, so TRPO employs a combination of conjugate gradient methods and a line search to find an approximate solution efficiently. The result is a monotonic improvement guarantee under certain conditions — each policy update is theoretically ensured not to degrade performance.
TRPO proved especially effective in high-dimensional continuous control tasks, such as robotic locomotion simulations, where naive policy gradient methods frequently diverge. Its principled treatment of update stability made it a landmark contribution when introduced by John Schulman and colleagues in 2015, and it established a conceptual framework that shaped subsequent work in the field. Most notably, Proximal Policy Optimization (PPO) emerged as a simpler, more computationally tractable successor that approximates TRPO's trust region constraint using a clipped surrogate objective rather than an explicit KL penalty.
While PPO has largely supplanted TRPO in practice due to its lower computational overhead and easier implementation, TRPO remains theoretically important as the rigorous foundation from which modern policy optimization methods descend. Understanding TRPO clarifies why stability constraints matter in reinforcement learning and provides intuition for the design choices embedded in the algorithms that followed it.