A downsampling operation that retains the maximum value within each local region.
Max pooling is a spatial downsampling operation used primarily in convolutional neural networks (CNNs) to reduce the height and width of feature maps while preserving the most prominent activations. The operation works by partitioning the input into non-overlapping (or sometimes overlapping) rectangular regions and outputting the maximum value from each region. A 2×2 max pooling layer with stride 2, for example, reduces each spatial dimension by half, cutting the total number of activations to one quarter of the original.
Beyond simple compression, max pooling serves several important functional roles. By retaining only the strongest activation in each local region, it introduces a degree of translational invariance — small shifts in the position of a feature in the input produce little or no change in the pooled output. This property helps networks generalize across slight variations in object position, scale, or distortion. Max pooling also acts as a form of feature selection, discarding weaker activations and concentrating information from the most salient detected patterns, which can reduce overfitting and improve computational efficiency in subsequent layers.
Max pooling became a standard architectural component following the success of AlexNet in the 2012 ImageNet Large Scale Visual Recognition Challenge, where its use in a deep CNN demonstrated state-of-the-art performance on large-scale image classification. Since then, it has appeared in nearly every major CNN architecture, including VGGNet, GoogLeNet, and ResNet. Its simplicity — no learnable parameters, straightforward gradient computation during backpropagation — makes it easy to integrate and computationally cheap relative to alternatives.
In recent years, max pooling has faced competition from alternative approaches. Strided convolutions can learn to downsample in a data-driven way, and global average pooling is often preferred before classification heads for its regularization benefits. Vision Transformers largely bypass spatial pooling altogether. Nevertheless, max pooling remains widely used and is a foundational concept for understanding how CNNs manage spatial hierarchy and feature abstraction.