An optimization technique that solves complex problems by caching solutions to overlapping subproblems.
Dynamic programming (DP) is an algorithmic technique for solving optimization and decision problems by decomposing them into smaller, overlapping subproblems, solving each subproblem exactly once, and storing the results in a lookup table or memoization cache. Rather than recomputing the same intermediate values repeatedly — as naive recursion would — DP reuses stored solutions, transforming many exponential-time algorithms into polynomial-time ones. The approach applies whenever a problem exhibits two key properties: optimal substructure (an optimal solution can be built from optimal solutions to subproblems) and overlapping subproblems (the same subproblems recur across different branches of computation).
In practice, DP can be implemented either top-down (recursive with memoization, where results are cached as they are first computed) or bottom-up (iterative tabulation, where smaller subproblems are solved first and results are stored in a table before tackling larger ones). Classic examples include computing Fibonacci numbers, shortest-path algorithms like Bellman-Ford, sequence alignment in bioinformatics, and the knapsack problem in combinatorial optimization. The technique is foundational to a wide range of algorithms across computer science and operations research.
In machine learning and reinforcement learning specifically, dynamic programming plays a central role. RL algorithms such as policy iteration and value iteration rely directly on DP principles to compute optimal policies in Markov decision processes (MDPs). Bellman's optimality equations — which express the value of a state as a function of the values of successor states — are the mathematical backbone of modern RL. Even deep RL methods like Q-learning and temporal difference learning are conceptually grounded in DP, approximating the Bellman equations when the state space is too large for exact tabular solutions.
Beyond reinforcement learning, DP appears in training sequence models, decoding algorithms for structured prediction (such as the Viterbi algorithm for hidden Markov models), and attention mechanisms in certain architectures. Its ability to make intractable recursive computations feasible makes it one of the most broadly applicable algorithmic paradigms in both classical and modern machine learning.