Enhances language model outputs by retrieving relevant documents before generating responses.
Retrieval-Augmented Generation (RAG) is an architecture that combines a neural language model with an external document retrieval system, allowing the model to consult a knowledge base at inference time rather than relying solely on information encoded in its parameters. When a query arrives, a retrieval component — typically a dense vector search over a large corpus — identifies the most semantically relevant passages. Those passages are then concatenated with the original query and fed into a generative model, which synthesizes a response grounded in the retrieved content. This two-stage pipeline was formalized in a 2020 paper from Facebook AI Research and quickly became a foundational pattern for knowledge-intensive NLP tasks.
The retrieval step usually relies on dense passage retrieval (DPR), where both queries and documents are encoded into a shared embedding space using a bi-encoder architecture. Approximate nearest-neighbor search — via tools like FAISS — makes it practical to search millions of documents in milliseconds. The generative component is typically a sequence-to-sequence model such as BART or a decoder-only model like GPT, fine-tuned to condition its outputs on the retrieved context. Some RAG variants marginalize over multiple retrieved documents during training, while others select a single top-ranked passage, trading coverage for simplicity.
RAG addresses a fundamental limitation of parametric language models: their knowledge is frozen at training time and can become stale, incomplete, or confidently wrong. By externalizing factual knowledge into a retrievable corpus, RAG systems can be updated simply by refreshing the document index rather than retraining the entire model. This makes them far more practical for domains where accuracy and currency matter, such as medical question answering, enterprise search, and legal research.
Since its introduction, RAG has evolved considerably. Modern implementations incorporate reranking stages, query rewriting, iterative retrieval, and hybrid sparse-dense search to improve recall and precision. The pattern has also been extended to multimodal settings, where retrieved content may include images or structured data. RAG now sits at the center of most production deployments of large language models that require factual grounding, making it one of the most practically impactful architectural ideas in recent NLP.