Expected cumulative reward for taking an action in a given state under a policy.
In reinforcement learning, a Q-value (also called an action-value) quantifies the expected total future reward an agent will accumulate by taking a specific action from a specific state and then following a given policy thereafter. Unlike a state value, which evaluates how good a state is in general, a Q-value evaluates the combination of state and action together, giving the agent a direct basis for choosing between alternatives. The higher the Q-value for a state-action pair, the more beneficial that action is expected to be in the long run.
Q-values are learned iteratively using the Bellman equation, which expresses a recursive relationship: the Q-value of a state-action pair equals the immediate reward received plus a discounted estimate of the best Q-value achievable from the resulting next state. In tabular Q-learning, these values are stored in a lookup table and updated with each experience. In environments with large or continuous state spaces, neural networks are used to approximate Q-values — a technique central to Deep Q-Networks (DQN), which demonstrated human-level performance on Atari games and brought deep reinforcement learning into the mainstream.
The practical importance of Q-values lies in how they enable decision-making. An agent following a greedy policy simply selects the action with the highest Q-value at each step. During training, exploration strategies like epsilon-greedy balance exploiting known high-Q actions with exploring potentially better ones. Accurate Q-value estimates are therefore essential: overestimation or instability in Q-values can cause learning to diverge, motivating techniques like target networks, experience replay, and Double DQN.
Q-values remain a cornerstone of model-free reinforcement learning. They appear in a wide range of algorithms beyond basic Q-learning, including SARSA, Dueling DQN, and actor-critic methods that use Q-value estimates as a critic signal. Understanding Q-values is foundational to understanding how agents learn to act optimally through trial and error in complex, reward-driven environments.