The number of future time steps an agent considers when making decisions.
In reinforcement learning and sequential decision-making, the horizon defines how far into the future an agent looks when evaluating the consequences of its actions. A finite horizon sets a fixed number of future steps the agent considers, while an infinite horizon allows the agent to account for rewards extending indefinitely into the future. This distinction fundamentally shapes how value functions are computed and what kinds of policies emerge from training.
The horizon length directly influences the mathematical formulation of the learning problem. In finite-horizon settings, value functions are time-dependent and must be computed backward from a terminal state. In infinite-horizon settings, a discount factor γ (gamma) between 0 and 1 is typically introduced to ensure the sum of future rewards remains finite and to implicitly encode how much the agent prioritizes near-term versus long-term outcomes. A discount factor close to 1 approximates a long horizon, while a value close to 0 produces effectively short-horizon behavior focused on immediate rewards.
The choice of horizon has profound practical consequences. Long horizons allow agents to discover strategies that sacrifice short-term gains for greater long-term payoffs — essential in games like chess or Go, or in robotic tasks with delayed success signals. However, long horizons increase the difficulty of credit assignment, making it harder to identify which early actions led to eventual outcomes. Short horizons simplify learning and reduce computational cost but can cause agents to behave myopically, missing opportunities that require planning ahead. Many real-world applications, such as financial portfolio management or autonomous navigation, require careful tuning of the effective horizon to balance these trade-offs.
Horizon also interacts with exploration strategies, sample efficiency, and algorithm stability. Techniques like n-step returns and eligibility traces were developed in part to interpolate between short and long horizons during training. More recently, transformer-based architectures in offline RL have demonstrated that explicitly conditioning on long context windows can dramatically improve performance on tasks requiring extended temporal reasoning, renewing interest in how horizon length is represented and managed within modern deep RL systems.