A deep learning architecture that learns spatial hierarchies of features from visual data.
A Convolutional Neural Network (CNN) is a class of deep neural network specifically designed to process data with a grid-like topology, most commonly images and video. Unlike fully connected networks that treat every input feature independently, CNNs exploit the spatial structure of visual data by applying learned filters across local regions of the input. This design choice dramatically reduces the number of parameters compared to dense networks and encodes a powerful inductive bias: that meaningful patterns can appear anywhere in an image and that nearby pixels are more related than distant ones.
The core building block of a CNN is the convolutional layer, where a set of learnable filters slides across the input to produce feature maps that highlight the presence of specific patterns—edges, textures, or more complex shapes in deeper layers. These are typically followed by nonlinear activation functions and pooling layers, which downsample feature maps to reduce spatial resolution and provide a degree of translation invariance. Stacking many such layers creates a hierarchy where early layers detect low-level features and later layers compose them into increasingly abstract representations, ultimately enabling the network to distinguish between complex categories like faces, objects, or medical anomalies.
CNNs became practically influential in machine learning with Yann LeCun's LeNet-5 in 1998, which demonstrated reliable handwritten digit recognition. The field was transformed in 2012 when AlexNet, trained on GPUs with large-scale data, won the ImageNet competition by a wide margin, triggering the modern deep learning era in computer vision. Subsequent architectures—VGGNet, GoogLeNet, ResNet, and EfficientNet—refined depth, width, and connectivity patterns to push accuracy and efficiency further.
Beyond image classification, CNNs underpin object detection, semantic segmentation, medical image analysis, autonomous driving perception, and even natural language processing tasks where local feature extraction is valuable. While transformer-based vision models have emerged as strong competitors, CNNs remain widely deployed due to their computational efficiency, well-understood behavior, and strong inductive biases that make them highly effective with limited data.