Learnable weights that define how a reinforcement learning agent selects actions.
In reinforcement learning, a policy is a mapping from states to actions — the agent's decision-making strategy. Policy parameters are the tunable variables that define this mapping, whether they are the weights of a neural network in deep RL, the coefficients of a linear function approximator, or the entries of a lookup table in tabular settings. During training, these parameters are updated to increase the expected cumulative reward the agent receives, typically through gradient-based optimization or evolutionary methods.
Policy gradient methods, such as REINFORCE, directly optimize policy parameters by estimating the gradient of expected return with respect to those parameters and ascending it. More sophisticated algorithms like Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) impose constraints on how much the parameters can change in a single update, stabilizing training and preventing destructive policy collapses. In actor-critic architectures, the actor's parameters define the policy while a separate critic network estimates value functions to reduce variance in gradient estimates.
Policy parameters are central to nearly every modern RL system, from game-playing agents to robotic control and large language model fine-tuning via RLHF. The quality of learned parameters determines how well a policy generalizes across unseen states, how efficiently it explores, and how robustly it transfers to new environments. Advances in parameterization — including continuous action distributions, attention-based architectures, and hypernetworks — continue to expand what RL agents can learn and how quickly they converge to effective behavior.