A statistical model that predicts each word based solely on the preceding word.
A bigram language model is a statistical approach to natural language processing that estimates the probability of a word given only the single word that immediately precedes it. Formally, it applies the Markov assumption to language: the probability of a sequence of words is approximated as the product of conditional probabilities, where each word depends only on its immediate predecessor. This reduces the intractable problem of modeling full sentence context to a manageable set of pairwise word co-occurrence statistics that can be estimated directly from a text corpus.
Training a bigram model involves counting how often each word pair appears in a corpus and normalizing those counts into conditional probabilities. For example, if the word "machine" is followed by "learning" 800 times out of 1,000 occurrences of "machine," the model assigns P(learning | machine) = 0.8. In practice, smoothing techniques such as Laplace smoothing or Kneser-Ney smoothing are applied to handle unseen word pairs that never appeared in training data, preventing zero-probability assignments that would break downstream calculations.
Bigram models became a foundational tool in NLP tasks including automatic speech recognition, spelling correction, predictive text, and early machine translation systems. Their appeal lies in computational simplicity: the model requires only a vocabulary-sized matrix of transition probabilities and can score or generate text extremely efficiently. However, the strict two-word window means the model is blind to longer-range dependencies — it cannot capture that "the bank" means something different in "the bank of a river" versus "the bank approved the loan."
Despite being largely superseded by neural language models such as RNNs, LSTMs, and transformers, bigram models remain relevant as baselines, as components in larger systems, and as pedagogical tools for understanding probabilistic language modeling. They also appear as a special case within n-gram model families, and the intuitions behind them — counting co-occurrences, applying smoothing, evaluating with perplexity — carry forward directly into modern NLP practice.