Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Unigram Entropy

Unigram Entropy

A measure of word-level unpredictability in text, assuming each word occurs independently.

Year: 1990Generality: 450
Back to Vocab

Unigram entropy is an information-theoretic measure that quantifies the average uncertainty involved in predicting any single word drawn from a text corpus, under the assumption that each word occurs independently of all others. Formally, it is computed as the Shannon entropy of the unigram probability distribution: for each word in the vocabulary, its probability of occurrence is estimated from corpus frequencies, and the entropy is the negative sum of each probability multiplied by its own logarithm. The result is expressed in bits (or nats, depending on the logarithm base) and represents the theoretical minimum number of bits needed to encode a randomly sampled word from that distribution.

In practice, unigram entropy serves as a baseline measure of corpus complexity. A high unigram entropy signals a broad, evenly distributed vocabulary where many words appear with similar frequency, making individual words hard to predict. Low entropy, by contrast, reflects a skewed distribution dominated by a small set of high-frequency words — a pattern consistent with Zipf's law, which describes natural language quite well. This baseline is especially useful when comparing corpora across domains, languages, or time periods, and when evaluating how much additional predictive power higher-order language models (bigrams, trigrams, neural models) provide over the unigram assumption.

Unigram entropy plays a practical role in several machine learning and NLP pipelines. In data compression, it sets a lower bound on achievable compression rates for a given vocabulary. In language model evaluation, it anchors perplexity calculations — perplexity is directly derived from entropy and measures how well a model predicts held-out text. Tokenization schemes such as the Unigram Language Model used in SentencePiece explicitly optimize a unigram entropy objective to select the most informative subword vocabulary for downstream tasks like machine translation and speech recognition.

Despite its simplicity — ignoring all contextual dependencies between words — unigram entropy remains a foundational diagnostic tool. It provides a reproducible, interpretable reference point against which more sophisticated models are benchmarked, and it continues to inform decisions about vocabulary size, data preprocessing, and model architecture in modern NLP systems.

Related

Related

Semantic Entropy
Semantic Entropy

A measure of uncertainty in the meaning of language model outputs.

Generality: 380
N-gram
N-gram

A contiguous sequence of N items drawn from text or speech.

Generality: 700
Surprisal
Surprisal

A measure of how unexpected an event is, based on its probability.

Generality: 620
Perplexity
Perplexity

A metric quantifying how well a language model predicts a text sequence.

Generality: 713
Trigrams
Trigrams

A sequence of three consecutive tokens used in language modeling and NLP.

Generality: 420
Bigram Language Model
Bigram Language Model

A statistical model that predicts each word based solely on the preceding word.

Generality: 574