A training objective where models learn to predict the next token in a sequence.
Next word prediction is a foundational task in natural language processing in which a model learns to estimate the probability of the next word (or token) given all preceding words in a sequence. Formally, this is the problem of modeling P(wₙ | w₁, w₂, ..., wₙ₋₁), and it serves as both a practical capability and a powerful self-supervised training signal. Because text corpora are abundantly available and require no manual labeling, next word prediction allows models to learn rich representations of language structure simply by training on raw text at massive scale.
The mechanics have evolved considerably over time. Early n-gram language models estimated word probabilities from co-occurrence counts, but were limited by their fixed context windows and inability to generalize. Recurrent neural networks improved on this by maintaining a hidden state that could theoretically capture long-range dependencies, though they struggled in practice with vanishing gradients. The transformer architecture, introduced in 2017, addressed these limitations through self-attention mechanisms that directly relate every token in a sequence to every other, enabling far more effective modeling of long-range context. Modern large language models such as GPT are trained almost entirely on this objective, predicting the next token autoregressively across billions of examples.
The significance of next word prediction extends well beyond autocomplete features on smartphones or search engines. Researchers discovered that models trained at sufficient scale on this simple objective develop emergent capabilities — including reasoning, translation, and code generation — that were never explicitly trained for. This insight reframed next word prediction from a narrow NLP task into a general-purpose pretraining strategy. Applications now span conversational AI, document summarization, creative writing assistance, and software development tools, making next word prediction one of the most consequential training paradigms in modern machine learning.