A weighting scheme that measures how important a word is within a document collection.
TF-IDF is a numerical statistic used in natural language processing and information retrieval to quantify how relevant a word is to a specific document within a larger corpus. It combines two complementary measures: Term Frequency (TF), which counts how often a word appears in a given document, and Inverse Document Frequency (IDF), which down-weights words that appear frequently across many documents. Multiplying these two values produces a score that is high for words that are distinctive to a particular document and low for words that are ubiquitous across the corpus — common stopwords like "the" or "and" receive near-zero scores, while rare but locally frequent terms receive high scores.
The mechanics are straightforward. TF is typically computed as the raw count of a term in a document, sometimes normalized by document length. IDF is calculated as the logarithm of the ratio of total documents to the number of documents containing the term, so a word appearing in every document gets an IDF of zero. The final TF-IDF score is the product of these two quantities, and documents or terms can then be compared using vector representations built from these scores — a framework known as the vector space model.
TF-IDF became foundational to information retrieval and text mining because it offers a simple, interpretable, and computationally efficient way to represent text. Search engines historically used TF-IDF as a core signal for ranking documents against a query, and it remains a strong baseline for tasks like keyword extraction, document similarity, and text classification. Even as neural embedding methods have grown dominant, TF-IDF retains practical value in low-resource settings, interpretability-sensitive applications, and as a feature engineering tool.
Despite its strengths, TF-IDF has notable limitations: it ignores word order and semantic meaning, treats synonyms as entirely distinct terms, and struggles with short documents where frequency statistics are unreliable. These shortcomings motivated the development of more expressive representations such as word embeddings and transformer-based models, but TF-IDF remains a widely taught and widely deployed technique that shaped how the field thinks about term weighting and document representation.