Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. NTP (Next Token Prediction)

NTP (Next Token Prediction)

A training objective where language models learn to predict the next token in a sequence.

Year: 2018Generality: 795
Back to Vocab

Next Token Prediction (NTP) is the foundational training objective underlying most modern autoregressive language models. Given a sequence of tokens — discrete units representing words, subwords, or characters — the model is trained to estimate the probability distribution over all possible next tokens. By repeatedly applying this objective across enormous text corpora, the model learns grammar, factual associations, reasoning patterns, and stylistic conventions purely from the statistical structure of language. At inference time, the same mechanism drives text generation: the model samples or selects from its predicted distribution, appends the chosen token, and repeats the process to produce fluent, coherent output.

The mechanics of NTP are tightly coupled to the transformer architecture. A causal (decoder-only) transformer processes the input sequence with masked self-attention, ensuring each token can only attend to preceding tokens. The final hidden state at each position is projected through a linear layer and a softmax activation to produce a probability distribution over the vocabulary. Training minimizes cross-entropy loss between predicted distributions and the actual next tokens — a simple objective that scales remarkably well with model size and data volume, a property formalized in neural scaling laws.

What makes NTP so powerful is that it is a self-supervised objective: no human-labeled data is required. Any raw text corpus serves as training signal, enabling models to train on trillions of tokens scraped from the web, books, and code. This abundance of supervision, combined with the expressive capacity of large transformers, produces models that generalize far beyond simple word completion — exhibiting emergent capabilities in translation, summarization, mathematical reasoning, and code generation.

NTP has become the dominant pre-training paradigm for large language models (LLMs) such as GPT-3, GPT-4, LLaMA, and Claude. While alternatives like masked language modeling (used in BERT) predict tokens at arbitrary positions, NTP's left-to-right generation makes it naturally suited for open-ended text synthesis. Fine-tuning techniques such as RLHF (Reinforcement Learning from Human Feedback) are typically layered on top of NTP-pretrained models to align behavior with human preferences, but the NTP foundation remains central to modern language AI.

Related

Related

Next Token Prediction
Next Token Prediction

A training objective where models learn to predict the next token in a sequence.

Generality: 794
Next Word Prediction
Next Word Prediction

A training objective where models learn to predict the next token in a sequence.

Generality: 794
Multi-Token Prediction
Multi-Token Prediction

A generation strategy where language models predict multiple output tokens simultaneously.

Generality: 380
GPT (Generative Pre-Trained Transformer)
GPT (Generative Pre-Trained Transformer)

A transformer-based language model pre-trained to generate coherent, human-like text.

Generality: 865
Replaced Token Detection
Replaced Token Detection

A self-supervised task where models learn to identify intentionally substituted tokens in sequences.

Generality: 339
Token Speculation Techniques
Token Speculation Techniques

Methods that predict multiple candidate tokens in parallel to accelerate text generation.

Generality: 450