A metric quantifying how well a language model predicts a text sequence.
Perplexity is an evaluation metric for language models that measures how well a model predicts a given sequence of text. Formally, it is defined as the exponential of the average negative log-likelihood per token: if a model assigns high probability to the actual words in a sequence, the perplexity is low; if the model is frequently surprised by the observed words, perplexity is high. Intuitively, a perplexity of k means the model is as uncertain as if it were choosing uniformly among k equally likely options at each step. This makes it a natural, interpretable measure of a model's predictive confidence.
To compute perplexity, the model scores each token in a held-out test set conditioned on all preceding tokens, accumulates the log-probabilities, averages them over the sequence length, and exponentiates the result. This process is closely tied to cross-entropy loss — perplexity is simply the exponentiation of cross-entropy — which means minimizing training loss and minimizing perplexity on a test set are equivalent objectives. The metric is therefore easy to compute as a byproduct of standard training pipelines, requiring no additional annotation or human judgment.
Perplexity matters because it provides a fast, reproducible, and annotation-free way to compare language models across architectures, training regimes, and data conditions. It has been used to benchmark n-gram models, recurrent neural networks, and large transformer-based models alike, enabling apples-to-apples comparisons as the field has evolved. Lower perplexity on a held-out corpus generally correlates with better downstream task performance, making it a reliable proxy metric during development.
Despite its utility, perplexity has important limitations. It is sensitive to tokenization choices, meaning models using different vocabularies cannot be directly compared by perplexity alone. It also does not capture all dimensions of language quality — a model can achieve low perplexity while still generating repetitive, incoherent, or factually incorrect text. For this reason, perplexity is best used alongside task-specific benchmarks and human evaluations rather than as a standalone measure of model quality.