A metric measuring overlap between predicted and ground-truth bounding boxes in object detection.
Intersection over Union (IoU) is a standard evaluation metric in computer vision that quantifies how well a predicted bounding box aligns with a ground-truth bounding box. It is computed by dividing the area of overlap between the two boxes by the area of their union — producing a value between 0 and 1, where 0 indicates no overlap and 1 indicates a perfect match. This simple geometric ratio gives practitioners an intuitive, threshold-based way to decide whether a detection should count as a true positive or a false positive.
In practice, IoU is applied by setting a threshold — commonly 0.5 — above which a predicted box is considered a correct detection. This threshold-based judgment feeds directly into precision and recall calculations, and ultimately into the mean Average Precision (mAP) scores that dominate object detection leaderboards. Datasets and benchmarks such as PASCAL VOC and MS COCO popularized IoU as the canonical evaluation protocol, and their widespread adoption cemented it as the community standard throughout the 2010s.
Beyond evaluation, IoU has become embedded in the training process itself. Loss functions such as GIoU (Generalized IoU), DIoU, and CIoU extend the basic overlap ratio to provide smoother gradients and better geometric guidance during optimization, addressing cases where standard IoU produces zero gradient for non-overlapping boxes. These IoU-based losses have improved convergence and localization accuracy in modern detectors like YOLO variants and DETR.
IoU also generalizes to segmentation tasks, where it is applied pixel-wise rather than box-wise — often called the Jaccard index — to measure how well a predicted segmentation mask covers the ground-truth region. Whether used for bounding boxes or masks, IoU's appeal lies in its geometric interpretability and its direct correspondence to the localization quality that matters in real-world applications such as autonomous driving, medical imaging, and surveillance. Its simplicity has made it one of the most durable and universally adopted metrics in all of computer vision.