Reordering an initial set of retrieved results using a more sophisticated secondary model.
Reranking is a two-stage approach used in information retrieval, search engines, and recommender systems where an initial candidate set—retrieved quickly using lightweight matching criteria like keyword overlap or embedding similarity—is subsequently reordered by a more powerful and computationally expensive model. The first stage prioritizes recall, casting a wide net to ensure relevant items are not missed. The second stage prioritizes precision, applying richer signals to surface the most relevant results at the top of the list, where user attention is concentrated and the impact on experience is greatest.
The reranking model typically has access to features unavailable or too costly to compute at retrieval time: fine-grained semantic similarity, user interaction history, contextual signals, cross-attention between query and document, and learned relevance scores from human-labeled data. In modern natural language processing pipelines, large pretrained models such as BERT-based cross-encoders are commonly used as rerankers, reading the query and each candidate document jointly to produce a nuanced relevance score. This contrasts with the bi-encoder retrieval stage, which encodes query and documents independently for speed.
Reranking has become especially prominent in retrieval-augmented generation (RAG) systems, where the quality of retrieved context directly affects downstream generation quality. By inserting a reranker between the retriever and the language model, practitioners can significantly improve answer accuracy without retraining the generator. The technique is also central to learning-to-rank frameworks, where models are trained on graded relevance judgments using objectives like pairwise or listwise loss functions designed specifically to optimize ranking metrics such as NDCG or MAP.
The practical value of reranking lies in its modularity and efficiency trade-off: it decouples the speed requirements of large-scale retrieval from the accuracy requirements of final result presentation. Systems can scale retrieval over billions of documents while applying expensive models only to a manageable shortlist of tens or hundreds of candidates. This architectural pattern has become a standard design principle across web search, enterprise search, question answering, and recommendation pipelines.