A function estimating expected cumulative reward from a given state or action.
In reinforcement learning (RL), a value function is a mathematical function that estimates the expected cumulative future reward an agent can obtain from a given situation. There are two primary variants: the state-value function V(s), which estimates the total reward expected when starting from state s and following a particular policy, and the action-value function Q(s, a) — often called the Q-function — which estimates the expected return when taking action a in state s and then following a policy thereafter. Together, these functions give an agent a way to evaluate how "good" any given situation or decision is in the long run, not just immediately.
Value functions are computed with respect to a policy — a strategy that dictates how the agent behaves. The Bellman equations, a set of recursive relationships, form the mathematical backbone of value function estimation. They express the value of a state as the immediate reward plus the discounted value of the next state, allowing values to be propagated backward through time. Algorithms like dynamic programming, temporal difference (TD) learning, and Monte Carlo methods all leverage these equations to iteratively refine value estimates from experience or simulation.
The practical importance of value functions in modern ML is enormous. Deep Q-Networks (DQN), introduced by DeepMind, approximated the Q-function with a deep neural network, enabling agents to master Atari games directly from raw pixels. Actor-critic architectures — foundational to state-of-the-art algorithms like PPO and SAC — use a learned value function (the "critic") to reduce variance in policy gradient estimates, dramatically improving training stability and sample efficiency. Value functions also underpin many approaches to reward shaping, exploration strategies, and safe RL.
Beyond game-playing, value functions are central to real-world RL applications including robotics, recommendation systems, and fine-tuning large language models via reinforcement learning from human feedback (RLHF). Accurately estimating value — especially in high-dimensional, continuous spaces — remains one of the core challenges of the field, driving ongoing research into better function approximators, off-policy correction methods, and uncertainty-aware value estimation.