A plot evaluating classifier performance by trading off precision against recall across thresholds.
A precision-recall curve is a diagnostic tool for evaluating binary classification models by plotting precision on the y-axis against recall on the x-axis across all possible decision thresholds. Precision measures the fraction of positive predictions that are actually correct, while recall (also called sensitivity) measures the fraction of true positives that the model successfully identifies. As the decision threshold is lowered, a model typically captures more true positives (higher recall) at the cost of also accepting more false positives (lower precision), and the curve traces this trade-off across the full threshold range.
The curve is especially valuable when class distributions are heavily skewed. In such settings, the ROC curve can paint an overly optimistic picture because it accounts for true negatives, which are abundant in imbalanced datasets and inflate apparent performance. The precision-recall curve sidesteps this by focusing exclusively on the positive class, making it the preferred evaluation tool in domains like fraud detection, rare disease diagnosis, and information retrieval, where the minority class is the primary concern. A model with a curve that hugs the top-right corner of the plot — maintaining high precision even at high recall — is considered strong.
A single scalar summary of the curve is often computed as the area under the precision-recall curve (AUPRC), sometimes called average precision. This metric aggregates performance across all thresholds into one number, enabling straightforward comparison between models. Unlike accuracy, AUPRC is robust to class imbalance and rewards models that rank true positives highly. Practitioners also use the curve interactively to select an operating threshold that satisfies application-specific constraints — for instance, a medical screening tool might prioritize recall to minimize missed diagnoses, accepting lower precision as a consequence.
The concept originates in information retrieval research from the 1970s and 1980s, where precision and recall were standard metrics for evaluating document search systems. It migrated into machine learning evaluation practice during the mid-2000s as large, imbalanced datasets became common in spam filtering, bioinformatics, and computer vision, cementing its role as a standard benchmark tool in modern ML workflows.