Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. N-gram

N-gram

A contiguous sequence of N items drawn from text or speech.

Year: 1980Generality: 700
Back to Vocab

An n-gram is a contiguous sequence of n items — typically words, characters, or phonemes — extracted from a body of text or speech. The value of n determines the scope: a unigram (n=1) is a single token, a bigram (n=2) is a pair of adjacent tokens, and a trigram (n=3) covers three consecutive tokens. By sliding a window of size n across a sequence, one can enumerate all such subsequences and count their frequencies, building a statistical picture of how language is structured.

In language modeling, n-gram models estimate the probability of a word given the n−1 words that precede it. This Markov assumption — that only recent context matters — makes the models computationally tractable. Probabilities are estimated from corpus counts, often with smoothing techniques (such as Laplace or Kneser-Ney smoothing) to handle unseen sequences. Despite their simplicity, n-gram language models powered state-of-the-art speech recognition and machine translation systems for decades, and they remain useful as fast baselines and feature extractors.

N-grams became central to NLP pipelines in the 1980s and 1990s as large text corpora and sufficient computing power became available. They underpin classic tasks such as spell correction, predictive text input, authorship attribution, and information retrieval. Character-level n-grams are especially useful for language identification and handling morphologically rich languages, since they capture sub-word patterns without requiring explicit tokenization.

With the rise of neural language models — RNNs, LSTMs, and eventually Transformers — n-gram models have been largely superseded for tasks requiring long-range context, because neural approaches learn dense representations that generalize far better to unseen sequences. Nevertheless, n-gram features continue to appear in text classification, document similarity metrics like BLEU score evaluation, and as interpretable components in hybrid systems. Their transparency and computational efficiency ensure they remain a practical tool in the modern NLP toolkit.

Related

Related

Trigrams
Trigrams

A sequence of three consecutive tokens used in language modeling and NLP.

Generality: 420
Bigram Language Model
Bigram Language Model

A statistical model that predicts each word based solely on the preceding word.

Generality: 574
Unigram Entropy
Unigram Entropy

A measure of word-level unpredictability in text, assuming each word occurs independently.

Generality: 450
Next Word Prediction
Next Word Prediction

A training objective where models learn to predict the next token in a sequence.

Generality: 794
NTP (Next Token Prediction)
NTP (Next Token Prediction)

A training objective where language models learn to predict the next token in a sequence.

Generality: 795
Token
Token

The basic unit of text that language models read, process, and generate.

Generality: 720