A single-score metric summarizing model performance across all precision-recall thresholds.
Average Precision (AP) is an evaluation metric that condenses the precision-recall tradeoff of a ranking or classification model into a single scalar value. Rather than measuring performance at a single operating threshold, AP captures how well a model performs across all possible thresholds by computing the area under the precision-recall curve. This makes it especially valuable when the cost of false positives and false negatives must both be considered, and when the optimal decision threshold is not known in advance.
The calculation works by first ranking all predictions by their confidence scores, then sweeping a threshold from high to low confidence. At each step where a true positive is encountered, the current precision is recorded. AP is then computed as the mean of these precision values, weighted by the incremental change in recall at each step. This interpolated or exact summation yields a value between 0 and 1, where 1 represents a perfect ranking that places all true positives before any false positives. In practice, implementations vary slightly — some use interpolation at fixed recall levels (as in PASCAL VOC object detection benchmarks), while others use the exact area under the curve.
Average Precision is particularly important in domains with severe class imbalance, where accuracy and even F1-score can be misleading. In object detection, for example, a model may need to identify rare objects among thousands of background regions; AP rewards models that consistently rank true detections highly regardless of how many negatives exist. Mean Average Precision (mAP), which averages AP across multiple object categories or IoU thresholds, has become the dominant benchmark metric in computer vision competitions such as COCO and PASCAL VOC.
Beyond computer vision, AP is widely used in information retrieval, document ranking, and medical diagnosis tasks where ranking quality matters more than binary classification at a fixed cutoff. Its ability to summarize an entire precision-recall curve in one interpretable number makes it a preferred metric in research publications and model selection pipelines, offering a more complete picture of model behavior than threshold-dependent alternatives.