Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Next Token Prediction

Next Token Prediction

A training objective where models learn to predict the next token in a sequence.

Year: 2018Generality: 794
Back to Vocab

Next token prediction is the core training objective behind most modern large language models (LLMs). Given a sequence of tokens — words, subwords, or characters — the model is trained to predict what token comes next. By repeating this process across billions of examples, the model learns grammar, facts, reasoning patterns, and stylistic conventions purely from the statistical structure of text. Despite its simplicity, this single objective is remarkably powerful: a model that can consistently predict the next token must, in some sense, understand the content it is processing.

The mechanics rely on computing a probability distribution over the entire vocabulary at each position in the sequence. During training, the model's predicted distribution is compared against the actual next token using cross-entropy loss, and gradients are backpropagated to update model weights. Modern transformer-based architectures process all token positions in parallel using causal (masked) self-attention, which prevents any position from attending to future tokens — preserving the integrity of the prediction task while enabling efficient training at scale.

Next token prediction serves a dual role: it is both a pre-training objective and, implicitly, a generative mechanism. At inference time, models sample or select from the predicted distribution repeatedly, appending each generated token to the context and feeding it back in — a process called autoregressive generation. This is how GPT-style models produce coherent paragraphs, answer questions, write code, and engage in dialogue. The same objective that trains the model also defines how it generates output.

The significance of next token prediction became fully apparent with the GPT series beginning in 2018, which demonstrated that scaling this simple objective on large corpora produced models with broad, transferable capabilities. It has since become the dominant pre-training paradigm for foundation models. Researchers have noted that next token prediction implicitly incentivizes world modeling — to predict text well, a model benefits from building internal representations of the underlying reality the text describes — making it far more than a narrow language task.

Related

Related

Next Word Prediction
Next Word Prediction

A training objective where models learn to predict the next token in a sequence.

Generality: 794
NTP (Next Token Prediction)
NTP (Next Token Prediction)

A training objective where language models learn to predict the next token in a sequence.

Generality: 795
Multi-Token Prediction
Multi-Token Prediction

A generation strategy where language models predict multiple output tokens simultaneously.

Generality: 380
Sequence Prediction
Sequence Prediction

Forecasting the next item(s) in a sequence by learning patterns from prior observations.

Generality: 794
Token Speculation Techniques
Token Speculation Techniques

Methods that predict multiple candidate tokens in parallel to accelerate text generation.

Generality: 450
Replaced Token Detection
Replaced Token Detection

A self-supervised task where models learn to identify intentionally substituted tokens in sequences.

Generality: 339