A reinforcement learning method that updates value estimates using differences between successive predictions.
Temporal difference (TD) learning is a foundational reinforcement learning technique that updates value estimates incrementally by comparing predictions made at successive time steps. Rather than waiting until the end of an episode to compute an error against the true outcome—as Monte Carlo methods do—TD learning bootstraps: it uses the current prediction plus an immediate reward to refine the previous prediction. The core update rule adjusts a value estimate toward a target that combines the observed reward with a discounted estimate of the next state's value, with the gap between these quantities called the TD error. This makes TD learning both online and incremental, enabling agents to learn continuously from each transition without requiring complete episodes or a model of the environment.
TD learning occupies a unique middle ground between dynamic programming and Monte Carlo methods. Like dynamic programming, it uses estimates of future values to update current estimates—a process called bootstrapping. Like Monte Carlo, it learns directly from raw experience without needing a transition model. This combination makes TD methods especially powerful in large or unknown environments where full planning is intractable. The simplest form, TD(0), updates based on a single step ahead, while TD(λ) generalizes this by blending multi-step returns using eligibility traces, allowing the algorithm to interpolate smoothly between one-step TD and full Monte Carlo returns.
TD learning is the conceptual backbone of many influential reinforcement learning algorithms. Q-learning and SARSA are both TD methods that extend the framework to learn action-value functions, enabling agents to derive explicit policies. Deep Q-Networks (DQN), which achieved superhuman performance on Atari games, apply TD learning with neural network function approximators. More recently, TD-style updates underpin actor-critic architectures and proximal policy optimization methods used in state-of-the-art systems.
The practical significance of TD learning lies in its sample efficiency and real-time applicability. Because updates occur after every transition, agents can begin improving their policies immediately, making TD methods well-suited to robotics, game playing, recommendation systems, and any sequential decision-making domain where waiting for episode completion is costly or impossible. Richard Sutton's 1988 formalization of the approach remains one of the most cited works in reinforcement learning.