Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Perplexity

Perplexity

A metric quantifying how well a language model predicts a text sequence.

Year: 1988Generality: 713
Back to Vocab

Perplexity is an evaluation metric for language models that measures how well a model predicts a given sequence of text. Formally, it is defined as the exponential of the average negative log-likelihood per token: if a model assigns high probability to the actual words in a sequence, the perplexity is low; if the model is frequently surprised by the observed words, perplexity is high. Intuitively, a perplexity of k means the model is as uncertain as if it were choosing uniformly among k equally likely options at each step. This makes it a natural, interpretable measure of a model's predictive confidence.

To compute perplexity, the model scores each token in a held-out test set conditioned on all preceding tokens, accumulates the log-probabilities, averages them over the sequence length, and exponentiates the result. This process is closely tied to cross-entropy loss — perplexity is simply the exponentiation of cross-entropy — which means minimizing training loss and minimizing perplexity on a test set are equivalent objectives. The metric is therefore easy to compute as a byproduct of standard training pipelines, requiring no additional annotation or human judgment.

Perplexity matters because it provides a fast, reproducible, and annotation-free way to compare language models across architectures, training regimes, and data conditions. It has been used to benchmark n-gram models, recurrent neural networks, and large transformer-based models alike, enabling apples-to-apples comparisons as the field has evolved. Lower perplexity on a held-out corpus generally correlates with better downstream task performance, making it a reliable proxy metric during development.

Despite its utility, perplexity has important limitations. It is sensitive to tokenization choices, meaning models using different vocabularies cannot be directly compared by perplexity alone. It also does not capture all dimensions of language quality — a model can achieve low perplexity while still generating repetitive, incoherent, or factually incorrect text. For this reason, perplexity is best used alongside task-specific benchmarks and human evaluations rather than as a standalone measure of model quality.

Related

Related

Surprisal
Surprisal

A measure of how unexpected an event is, based on its probability.

Generality: 620
Unigram Entropy
Unigram Entropy

A measure of word-level unpredictability in text, assuming each word occurs independently.

Generality: 450
Semantic Entropy
Semantic Entropy

A measure of uncertainty in the meaning of language model outputs.

Generality: 380
Surprise
Surprise

A measure of how unexpected or novel an outcome is given a model's predictions.

Generality: 620
Permutation
Permutation

An ordered arrangement of elements from a set, foundational to combinatorics and ML algorithms.

Generality: 795
Next Token Prediction
Next Token Prediction

A training objective where models learn to predict the next token in a sequence.

Generality: 794