A probabilistic model that discovers hidden topics across a collection of documents.
Latent Dirichlet Allocation (LDA) is a generative probabilistic model used primarily in natural language processing to uncover latent thematic structure within large collections of text. Introduced by David Blei, Andrew Ng, and Michael I. Jordan in 2003, LDA operates on the assumption that each document in a corpus is composed of a mixture of topics, and each topic is characterized by a probability distribution over words. By inferring these hidden topic structures from observed word patterns, LDA allows researchers and practitioners to organize, summarize, and explore large text corpora without requiring any labeled training data.
The mechanics of LDA rely on the Dirichlet distribution as a prior over both the topic mixtures within documents and the word mixtures within topics. During inference, the model works backward from the observed words in a corpus to estimate the most likely topic assignments that could have generated those words. Common inference approaches include variational Bayes and Gibbs sampling. The result is a set of topics — each represented as a ranked list of words — along with per-document topic proportions that describe how much each topic contributes to a given document.
LDA has proven broadly useful across a range of applications beyond basic topic discovery. It has been applied to document classification, information retrieval, recommendation systems, and even non-text domains such as image analysis and bioinformatics. Its unsupervised nature makes it especially valuable when labeled data is scarce or expensive to obtain, and its interpretable output — human-readable word clusters — gives it an advantage over many black-box alternatives.
Despite its influence, LDA has notable limitations. It assumes a bag-of-words representation, ignoring word order and syntax, and requires the number of topics to be specified in advance. It can also struggle with short texts and may produce topics that are difficult to interpret. These shortcomings have motivated the development of neural topic models and transformer-based approaches, but LDA remains a foundational baseline and a widely taught method in the NLP toolkit.