A parameter that reduces the weight of future rewards relative to immediate ones.
In reinforcement learning (RL), the discount factor — typically denoted γ (gamma) — controls how much an agent values future rewards compared to immediate ones. It is a scalar between 0 and 1 that multiplies each future reward by γ raised to the power of how many time steps away that reward occurs. A value near 0 makes the agent myopic, prioritizing short-term gains, while a value near 1 encourages the agent to plan far into the future. Beyond shaping agent behavior, the discount factor serves a mathematical necessity: it guarantees that the cumulative sum of rewards — the return — converges to a finite value in infinite-horizon tasks, making optimization tractable.
The discount factor appears directly in the Bellman equation, the recursive relationship at the heart of most RL algorithms. When computing the value of a state, the agent adds the immediate reward to γ times the estimated value of the next state. This structure propagates value estimates backward through time, allowing agents to learn long-range consequences of their actions through methods like Q-learning, SARSA, and policy gradient algorithms. The choice of γ is a critical hyperparameter: too low and the agent fails to account for delayed consequences; too high and training can become unstable or slow to converge.
Choosing an appropriate discount factor involves real trade-offs. In environments with dense, frequent rewards, lower values of γ often suffice and can speed up learning. In sparse-reward settings — where meaningful feedback may arrive only after many steps — higher values are typically necessary for the agent to connect actions to their eventual outcomes. Some modern approaches use adaptive or learned discount factors, or frame the problem in terms of average reward rather than discounted return, sidestepping the issue entirely.
The concept has deep roots in economics, where discounting future cash flows to present value is foundational to investment analysis. Its adoption into RL formalism, particularly through Markov Decision Processes and dynamic programming, gave the field a principled way to handle temporally extended decision-making. Today, the discount factor remains one of the most fundamental and universally used components across virtually all RL frameworks.