N-gram

An n-gram is a contiguous sequence of n items — typically words, characters, or phonemes — extracted from a body of text or speech. The value of n determines the scope: a unigram (n=1) is a single token, a bigram (n=2) is a pair of adjacent tokens, and a trigram (n=3) covers three consecutive tokens. By sliding a window of size n across a sequence, one can enumerate all such subsequences and count their frequencies, building a statistical picture of how language is structured.

In language modeling, n-gram models estimate the probability of a word given the n−1 words that precede it. This Markov assumption — that only recent context matters — makes the models computationally tractable. Probabilities are estimated from corpus counts, often with smoothing techniques (such as Laplace or Kneser-Ney smoothing) to handle unseen sequences. Despite their simplicity, n-gram language models powered state-of-the-art speech recognition and machine translation systems for decades, and they remain useful as fast baselines and feature extractors.

N-grams became central to NLP pipelines in the 1980s and 1990s as large text corpora and sufficient computing power became available. They underpin classic tasks such as spell correction, predictive text input, authorship attribution, and information retrieval. Character-level n-grams are especially useful for language identification and handling morphologically rich languages, since they capture sub-word patterns without requiring explicit tokenization.

With the rise of neural language models — RNNs, LSTMs, and eventually Transformers — n-gram models have been largely superseded for tasks requiring long-range context, because neural approaches learn dense representations that generalize far better to unseen sequences. Nevertheless, n-gram features continue to appear in text classification, document similarity metrics like BLEU score evaluation, and as interpretable components in hybrid systems. Their transparency and computational efficiency ensure they remain a practical tool in the modern NLP toolkit.

N-gram

Related

N-gram

Related

Related

Related