Verified reference data used to train and evaluate machine learning models.
Ground truth refers to the verified, authoritative data used as a reference standard when training and evaluating machine learning models. In supervised learning, ground truth labels represent the correct answers a model is expected to learn — for example, bounding boxes drawn around objects in images, transcriptions of spoken audio, or sentiment labels attached to product reviews. The quality and accuracy of ground truth data directly determines the ceiling of a model's performance: a model trained on noisy or mislabeled ground truth will inherit those errors regardless of its architectural sophistication.
Creating ground truth typically involves human annotation, expert labeling, or collection from authoritative sources such as medical records, legal databases, or sensor measurements. In computer vision tasks, annotators might manually outline every pedestrian in thousands of street-level photographs. In natural language processing, linguists might tag parts of speech or mark named entities across large text corpora. This annotation process is often expensive and time-consuming, which has driven significant research into techniques like active learning and semi-supervised learning that reduce the volume of labeled data required.
During model evaluation, ground truth serves as the benchmark against which predictions are compared to compute metrics such as accuracy, precision, recall, and F1 score. The gap between a model's outputs and the ground truth quantifies its error and guides iterative improvement. In some domains, obtaining true ground truth is genuinely difficult — medical diagnoses may be disputed among experts, or the "correct" translation of a sentence may be legitimately ambiguous — leading researchers to use inter-annotator agreement scores to measure label reliability.
The concept has become increasingly important as AI systems are deployed in high-stakes settings where model errors carry real consequences. Debates around ground truth quality have also intersected with fairness concerns: if historical ground truth data reflects societal biases, models trained on it may perpetuate or amplify those biases. As a result, ground truth curation is now recognized not merely as a technical task but as a consequential design decision that shapes what a model learns to perceive as correct.