NLP systems that automatically find or generate accurate answers to natural language questions.
Question answering (QA) is a subfield of natural language processing concerned with building systems that can understand a question posed in natural language and return a precise, relevant answer. Unlike general information retrieval, which returns a ranked list of documents, QA systems aim to extract or synthesize a specific answer—whether a single fact, a passage, or a generated response. The task spans a wide spectrum of formats: open-domain QA draws on large corpora or the web, closed-domain QA operates within a restricted knowledge base, and reading comprehension QA requires finding answers within a provided passage.
Modern QA systems are typically built on large pretrained language models such as BERT, RoBERTa, or GPT-style architectures. In extractive QA, the model identifies a span of text within a source document that best answers the query. In generative QA, the model produces a free-form answer conditioned on retrieved context, often using a retrieval-augmented generation (RAG) pipeline that first fetches relevant documents and then synthesizes a response. Training relies heavily on benchmark datasets such as SQuAD, Natural Questions, TriviaQA, and HotpotQA, which have driven rapid progress by providing standardized evaluation.
The practical importance of QA is substantial. Virtual assistants, enterprise search tools, customer support chatbots, and medical information systems all depend on QA capabilities to surface actionable answers quickly. The shift from keyword-based retrieval to neural QA has dramatically improved answer quality, enabling systems to handle paraphrased questions, multi-hop reasoning across documents, and ambiguous queries. Evaluation metrics such as Exact Match and F1 score over answer spans remain standard, though human evaluation is increasingly used for generative systems where multiple valid phrasings exist.
QA sits at the intersection of information retrieval, knowledge representation, and language understanding, making it a useful benchmark for overall NLP progress. Advances in QA have closely tracked broader developments in deep learning—from early neural reading comprehension models around 2016 to the transformer era that followed. Today, large language models have blurred the line between QA and open-ended generation, raising new challenges around factual accuracy, hallucination, and source attribution.