A reinforcement learning algorithm using mirror descent for stable, geometry-aware policy updates.
Mirror Descent Policy Optimization (MDPO) is a reinforcement learning algorithm that applies the mirror descent optimization framework to the problem of policy improvement. Traditional policy gradient methods update parameters using standard Euclidean gradient steps, which can be poorly suited to the geometry of probability distributions over actions. MDPO instead performs updates in a dual space defined by a convex mirror map, allowing policy changes to respect the intrinsic structure of the policy space. This geometric awareness leads to more principled updates that naturally handle constraints and non-linearities common in reinforcement learning settings.
The algorithm alternates between a policy evaluation phase, where the current policy's value function is estimated, and a mirror descent update phase, where the policy is improved by solving a proximal optimization problem in the dual space and projecting the result back to the primal space. The choice of mirror map is central to the method's behavior: using the negative entropy as the mirror map, for instance, recovers update rules closely related to trust-region and proximal policy optimization methods. This connection reveals that several prominent RL algorithms can be understood as special cases of the broader mirror descent framework.
MDPO matters because it provides a unifying theoretical lens for understanding policy optimization algorithms and offers practical benefits in terms of convergence stability and sample efficiency. By controlling the size of policy updates in a geometry-aware way, MDPO reduces the risk of destructively large policy changes that can destabilize training—a persistent challenge in deep reinforcement learning. The framework also facilitates rigorous convergence analysis, since mirror descent has well-understood theoretical guarantees in convex optimization that can be partially extended to the RL setting.
The application of mirror descent to policy optimization gained significant attention around 2019–2020, driven by work from researchers including Matteo Papini, Michele Pirotta, and Marcello Restelli. Their contributions helped formalize MDPO and clarify its relationships to existing methods like PPO and TRPO, strengthening both the theoretical foundations and practical toolkit available to reinforcement learning practitioners.