Classifying every pixel in an image into a meaningful object category.
Semantic segmentation is a computer vision task in which every pixel of an image is assigned a class label — such as "road," "sky," "pedestrian," or "building" — producing a dense, pixel-level understanding of a scene. Unlike image classification, which assigns a single label to an entire image, or object detection, which draws bounding boxes around objects, semantic segmentation provides a precise spatial map of what occupies every location in the image. This granularity makes it one of the most demanding and informative forms of visual perception.
Modern semantic segmentation relies heavily on deep learning, particularly convolutional neural networks (CNNs) with encoder-decoder architectures. The encoder progressively downsamples the input image to extract high-level semantic features, while the decoder upsamples those features back to the original resolution to produce per-pixel predictions. A landmark advance came with Fully Convolutional Networks (FCN) in 2015, which replaced the fully connected layers of classification networks with convolutional layers, enabling end-to-end pixel-wise prediction. Subsequent architectures like DeepLab introduced dilated (atrous) convolutions and conditional random fields to capture multi-scale context without sacrificing resolution, while U-Net became the dominant approach in medical imaging by using skip connections to preserve fine spatial detail.
The practical importance of semantic segmentation spans numerous high-stakes domains. In autonomous driving, it enables vehicles to distinguish drivable road surface from sidewalks, obstacles, and lane markings in real time. In medical imaging, it supports the precise delineation of tumors, organs, and tissue boundaries. In satellite and aerial imagery analysis, it powers land-use classification and environmental monitoring at scale. Augmented reality systems use it to separate foreground subjects from backgrounds for realistic scene compositing.
Training semantic segmentation models requires large datasets of images with dense pixel-level annotations, which are expensive and time-consuming to produce. Benchmarks such as PASCAL VOC, Cityscapes, and ADE20K have driven progress by providing standardized evaluation. More recently, transformer-based architectures like SegFormer and Mask2Former have pushed state-of-the-art performance further by capturing long-range spatial dependencies that CNNs struggle to model, and semi-supervised and self-supervised approaches are reducing the reliance on costly labeled data.