A learning paradigm where an agent maximizes cumulative rewards through environmental interaction.
Reinforcement learning (RL) is a branch of machine learning in which an autonomous agent learns to make decisions by interacting with an environment and receiving feedback in the form of rewards or penalties. Unlike supervised learning, which relies on labeled input-output pairs, RL agents are not told what the correct action is — they must discover effective strategies through trial and error. The agent observes the current state of the environment, selects an action according to a policy, receives a reward signal, and transitions to a new state. The goal is to learn a policy that maximizes cumulative reward over time, a quantity often formalized as the expected discounted return.
The mathematical backbone of RL is the Markov Decision Process (MDP), which provides a principled framework for modeling sequential decision-making under uncertainty. Core algorithms fall into several families: value-based methods (such as Q-learning) estimate the long-term value of state-action pairs; policy gradient methods directly optimize the policy parameters; and actor-critic architectures combine both approaches. The advent of deep reinforcement learning — pairing deep neural networks with RL algorithms — dramatically expanded the complexity of problems RL could tackle, enabling agents to learn directly from high-dimensional inputs like raw pixels.
RL has produced some of the most striking demonstrations of machine intelligence. DeepMind's DQN mastered Atari games from raw screen input in 2013, AlphaGo defeated world champion Go players in 2016, and OpenAI Five competed at a professional level in Dota 2. More recently, RL has become central to aligning large language models with human preferences through techniques like Reinforcement Learning from Human Feedback (RLHF). These successes have cemented RL as a critical tool wherever sequential decision-making, planning, or adaptive control is required.
Despite its power, RL remains challenging in practice. Agents often require enormous amounts of experience to learn effective policies, reward functions can be difficult to specify correctly, and learned behaviors may not generalize well beyond training conditions. Active research areas include sample efficiency, safe exploration, multi-agent RL, and offline RL — where agents learn from fixed datasets rather than live interaction — all aimed at making RL more practical for real-world deployment.