A regularization technique that randomly deactivates neurons during training to prevent overfitting.
Dropout is a regularization technique for neural networks that works by randomly setting a fraction of neuron activations to zero during each forward pass of training. Rather than always propagating signals through every unit, the network temporarily "drops" a randomly selected subset of neurons — typically with a probability between 0.2 and 0.5 — forcing the remaining units to compensate. At inference time, all neurons are active, but their outputs are scaled down proportionally to account for the larger number of active units, ensuring consistent expected activation magnitudes.
The core intuition behind dropout is that it prevents neurons from co-adapting too closely to one another. When any given neuron can be absent at any training step, the network cannot rely on specific combinations of neurons to encode a pattern. Instead, it must learn more distributed, redundant representations. This is loosely analogous to training an ensemble of exponentially many different network architectures simultaneously and averaging their predictions — a perspective that helps explain why dropout so reliably improves generalization.
Dropout proved especially impactful in the deep learning era, where large networks with millions of parameters are highly susceptible to memorizing training data. Its introduction coincided with the rise of convolutional and recurrent architectures, and it became a standard component in models achieving state-of-the-art results across image recognition, speech recognition, and natural language processing. Variants such as spatial dropout (dropping entire feature maps in CNNs) and variational dropout (connecting dropout to Bayesian inference) have since extended the original idea to more specialized settings.
While newer architectures — particularly transformers — often rely more heavily on other regularization strategies like weight decay and layer normalization, dropout remains widely used and is still a default tool in practitioners' toolkits. Its simplicity, low computational overhead, and consistent empirical benefits have cemented it as one of the most influential ideas in modern deep learning.