The step size by which a convolutional filter moves across an input during convolution.
In convolutional neural networks (CNNs), stride length is a hyperparameter that controls how many positions a convolutional filter shifts at each step as it slides across an input image or feature map. A stride of 1 means the filter moves one pixel at a time, producing maximum overlap between adjacent receptive fields. A stride of 2 means the filter jumps two pixels at a time, skipping intermediate positions. This seemingly simple parameter has significant downstream effects on the shape and content of the resulting feature maps.
The relationship between stride and output dimensions is straightforward: larger strides produce smaller output feature maps. For a given input of size W, a filter of size F, padding P, and stride S, the output dimension is ⌊(W − F + 2P) / S⌋ + 1. This means stride acts as a form of spatial downsampling, reducing the resolution of feature maps as information flows deeper into the network. In this sense, strided convolutions can serve a similar function to pooling layers, and many modern architectures use strided convolutions in place of explicit pooling to reduce spatial dimensions while learning the downsampling operation rather than applying a fixed one.
Stride length directly governs the trade-off between spatial resolution and computational efficiency. A stride of 1 preserves fine-grained spatial detail but requires more computation and produces larger intermediate representations. Larger strides reduce memory and compute requirements but may discard spatially precise information. In practice, strides of 1 and 2 are most common — stride 1 for layers where preserving resolution matters, and stride 2 for intentional downsampling. The choice of stride interacts closely with filter size, padding strategy, and network depth to determine the overall receptive field and representational capacity of the model.
Stride became a central design consideration with the rise of deep CNN architectures in the early 2010s. AlexNet (2012) notably used a stride of 4 in its first convolutional layer to aggressively reduce spatial dimensions from large input images, a design choice that influenced subsequent architectures. Later work, including the VGG and ResNet families, refined stride usage as a principled tool for controlling information flow through deep networks.