A training objective where language models learn to predict the next token in a sequence.
Next Token Prediction (NTP) is the foundational training objective underlying most modern autoregressive language models. Given a sequence of tokens — discrete units representing words, subwords, or characters — the model is trained to estimate the probability distribution over all possible next tokens. By repeatedly applying this objective across enormous text corpora, the model learns grammar, factual associations, reasoning patterns, and stylistic conventions purely from the statistical structure of language. At inference time, the same mechanism drives text generation: the model samples or selects from its predicted distribution, appends the chosen token, and repeats the process to produce fluent, coherent output.
The mechanics of NTP are tightly coupled to the transformer architecture. A causal (decoder-only) transformer processes the input sequence with masked self-attention, ensuring each token can only attend to preceding tokens. The final hidden state at each position is projected through a linear layer and a softmax activation to produce a probability distribution over the vocabulary. Training minimizes cross-entropy loss between predicted distributions and the actual next tokens — a simple objective that scales remarkably well with model size and data volume, a property formalized in neural scaling laws.
What makes NTP so powerful is that it is a self-supervised objective: no human-labeled data is required. Any raw text corpus serves as training signal, enabling models to train on trillions of tokens scraped from the web, books, and code. This abundance of supervision, combined with the expressive capacity of large transformers, produces models that generalize far beyond simple word completion — exhibiting emergent capabilities in translation, summarization, mathematical reasoning, and code generation.
NTP has become the dominant pre-training paradigm for large language models (LLMs) such as GPT-3, GPT-4, LLaMA, and Claude. While alternatives like masked language modeling (used in BERT) predict tokens at arbitrary positions, NTP's left-to-right generation makes it naturally suited for open-ended text synthesis. Fine-tuning techniques such as RLHF (Reinforcement Learning from Human Feedback) are typically layered on top of NTP-pretrained models to align behavior with human preferences, but the NTP foundation remains central to modern language AI.