Reinforcement learning architecture combining a policy-selecting actor with an evaluating critic.
Actor-critic models are a class of reinforcement learning architectures that combine two complementary components: an actor, which selects actions according to a learned policy, and a critic, which evaluates those actions by estimating a value function. This dual structure bridges the gap between purely policy-based methods, which can suffer from high variance, and purely value-based methods, which struggle with continuous or high-dimensional action spaces. By integrating both approaches, actor-critic models achieve more stable and sample-efficient learning than either method alone.
The actor operates by mapping states to actions — either deterministically or through a probability distribution — and updates its policy based on feedback from the critic. The critic estimates how good a given state or state-action pair is, typically using temporal-difference learning to compute quantities like the advantage function, which measures whether an action performed better or worse than expected. This advantage signal is then used to adjust the actor's policy gradient, nudging it toward actions that yield higher long-term rewards while reducing the noise inherent in raw return estimates.
Actor-critic methods became especially powerful with the integration of deep neural networks, giving rise to influential algorithms such as Asynchronous Advantage Actor-Critic (A3C), Proximal Policy Optimization (PPO), and Soft Actor-Critic (SAC). These deep variants enabled breakthroughs in domains requiring complex, continuous control — including robotic manipulation, simulated locomotion, and game-playing agents — where earlier tabular or shallow methods fell short. PPO in particular became a widely adopted baseline due to its simplicity and robust empirical performance.
The practical importance of actor-critic architectures extends across modern reinforcement learning research and application. They underpin many state-of-the-art systems in autonomous driving, real-time strategy games, and fine-tuning large language models via reinforcement learning from human feedback (RLHF). Their ability to handle continuous action spaces, stabilize training through critic-guided updates, and scale with deep learning makes them one of the most versatile and widely used frameworks in contemporary machine learning.