A cutoff value that maps a model's probability output to a discrete class label.
A classification threshold is a scalar cutoff applied to a model's continuous output — typically a probability — to produce a discrete class prediction. In binary classification, a model such as logistic regression or a neural network with a sigmoid output layer produces a score between 0 and 1 representing the estimated probability that an input belongs to the positive class. The threshold, commonly defaulted to 0.5, determines the decision boundary: scores above it are assigned the positive label, scores at or below it receive the negative label. This simple mechanism bridges the gap between probabilistic model outputs and the categorical decisions required in practice.
The choice of threshold directly governs the trade-off between true positive rate (sensitivity) and false positive rate (specificity), a relationship visualized through the ROC (Receiver Operating Characteristic) curve. Lowering the threshold increases recall — catching more true positives — at the cost of more false positives. Raising it does the opposite. This trade-off is not merely academic: in medical screening, a low threshold minimizes missed diagnoses; in spam filtering, a higher threshold reduces false alarms. The optimal threshold is therefore application-dependent and is often selected by maximizing a composite metric such as the F1 score or by minimizing an asymmetric cost function that weights different error types unequally.
Threshold selection is closely tied to model calibration. A well-calibrated model produces probability estimates that accurately reflect empirical frequencies, making threshold choices more interpretable and reliable. Poorly calibrated models may require post-hoc adjustment using techniques like Platt scaling or isotonic regression before threshold tuning is meaningful. Tools such as precision-recall curves and the ROC curve help practitioners visualize performance across all possible thresholds, enabling informed selection rather than reliance on the arbitrary 0.5 default.
Beyond binary classification, the concept extends to multi-class settings through one-vs-rest schemes, where a separate threshold governs each class, and to anomaly detection, where a threshold on an anomaly score separates normal from aberrant instances. As ML systems are increasingly deployed in high-stakes domains, deliberate threshold selection has become a core component of responsible model deployment, directly influencing fairness, safety, and operational cost.