A model-free reinforcement learning algorithm that learns optimal action values through experience.
Q-learning is a model-free reinforcement learning algorithm that enables an agent to learn how to act optimally in an environment by estimating the value of taking specific actions in specific states. At its core, Q-learning maintains a Q-function — often represented as a table or, in modern implementations, a neural network — that maps state-action pairs to expected cumulative rewards. The agent iteratively updates these Q-values using the Bellman equation: after taking an action and observing the resulting reward and next state, it adjusts its estimate to bring it closer to the true long-term value. Over many interactions, this process converges toward an optimal policy without ever requiring a model of the environment's transition dynamics.
A defining characteristic of Q-learning is that it is an off-policy algorithm, meaning the agent can learn about the optimal policy while following a different, often more exploratory, behavior policy. This is typically managed through an epsilon-greedy strategy, where the agent occasionally takes random actions to explore the environment rather than always exploiting its current best estimates. This separation between the learning target and the behavior policy makes Q-learning particularly flexible and sample-efficient in many settings, allowing it to incorporate experience from diverse sources, including replay buffers of past interactions.
Q-learning gained renewed prominence with the introduction of Deep Q-Networks (DQN) by DeepMind in 2015, which replaced the traditional lookup table with a deep convolutional neural network to approximate the Q-function. This allowed the algorithm to scale to high-dimensional state spaces — most famously, learning to play Atari video games directly from raw pixel inputs at superhuman levels. Techniques such as experience replay and target networks were introduced to stabilize training, addressing the instability that arises when neural networks are used as function approximators in this setting.
Today, Q-learning and its deep variants remain foundational to reinforcement learning research and practice. The algorithm underpins a wide range of applications, from robotic control and autonomous navigation to game playing and resource scheduling. Its simplicity, theoretical guarantees of convergence under tabular conditions, and adaptability to complex function approximators make it one of the most studied and widely applied algorithms in the field.