Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Bellman Equation

Bellman Equation

Recursive formula for computing optimal value functions in sequential decision-making.

Year: 1988Generality: 838
Back to Vocab

The Bellman equation is a recursive mathematical relationship that decomposes the problem of finding an optimal policy into a sequence of simpler subproblems. At its core, it expresses the value of being in a given state as the sum of the immediate reward obtained by taking an action and the discounted value of the resulting next state. This decomposition reflects the principle of optimality: an optimal policy must produce optimal decisions at every subsequent step, regardless of how the current state was reached. In reinforcement learning, two key variants appear — the Bellman expectation equation, which evaluates the value of following a fixed policy, and the Bellman optimality equation, which directly characterizes the value achievable under the best possible policy.

In practice, the Bellman equation underpins most classical and modern reinforcement learning algorithms. Temporal difference methods such as Q-learning and SARSA use it to iteratively update value estimates by comparing predicted rewards against observed outcomes. Deep reinforcement learning systems like DQN extend this by approximating the value function with neural networks, using the Bellman equation to construct training targets. The discount factor γ within the equation controls how much the agent prioritizes immediate versus future rewards, and tuning it is often critical to learning stable, effective behavior.

The equation's power lies in enabling tractable solutions to problems that would otherwise require exhaustive search over all possible action sequences. By expressing long-horizon optimization recursively, it allows algorithms to propagate reward information backward through time, gradually refining value estimates across an agent's state space. This makes it applicable to environments ranging from simple grid worlds to complex continuous-control tasks.

Originally formulated by Richard Bellman in the 1950s for dynamic programming in operations research and control theory, the equation became central to machine learning as reinforcement learning matured in the 1980s and 1990s. Today it remains one of the most foundational concepts in the field, connecting theoretical guarantees about optimal decision-making to practical algorithms deployed in robotics, game playing, and autonomous systems.

Related

Related

Value Function
Value Function

A function estimating expected cumulative reward from a given state or action.

Generality: 842
Q-Learning
Q-Learning

A model-free reinforcement learning algorithm that learns optimal action values through experience.

Generality: 792
Q-Value
Q-Value

Expected cumulative reward for taking an action in a given state under a policy.

Generality: 756
DP (Dynamic Programming)
DP (Dynamic Programming)

An optimization technique that solves complex problems by caching solutions to overlapping subproblems.

Generality: 838
Temporal Difference Learning
Temporal Difference Learning

A reinforcement learning method that updates value estimates using differences between successive predictions.

Generality: 780
RL (Reinforcement Learning)
RL (Reinforcement Learning)

A learning paradigm where an agent maximizes cumulative rewards through environmental interaction.

Generality: 908