Shortcut connections in deep networks that enable training of much deeper architectures.
Residual connections are architectural shortcuts in deep neural networks that bypass one or more layers by adding a layer's input directly to its output. Formally, instead of learning a direct mapping H(x), a layer learns a residual function F(x) = H(x) − x, so the full output becomes F(x) + x. This seemingly simple modification has profound consequences: gradients can flow backward through the additive shortcut without passing through potentially saturating nonlinearities, dramatically reducing the vanishing gradient problem that had long prevented practitioners from training very deep networks.
The practical impact is substantial. Before residual connections, networks deeper than roughly 20–30 layers would often perform worse than shallower counterparts due to optimization difficulties — not overfitting, but outright training failure. Residual connections broke this barrier, enabling the training of networks with hundreds or even thousands of layers. The original ResNet architecture won the 2015 ImageNet competition with a 152-layer model, a depth that would have been essentially untrainable with conventional feedforward designs.
Residual connections have proven remarkably general beyond image classification. They are a foundational component of the Transformer architecture, where they appear around every attention and feed-forward sublayer, stabilizing training of very deep language models. They also appear in U-Nets for image segmentation, dense networks, highway networks, and nearly every state-of-the-art deep learning architecture developed since 2015. Their success has inspired related ideas such as dense connections, which add shortcuts between all pairs of layers, and gated residual connections, which learn how much of the shortcut to pass through.
Theoretically, residual connections are understood to smooth the loss landscape, making it less chaotic and easier for gradient-based optimizers to navigate. They also provide an implicit ensemble effect: a deep residual network can be interpreted as a collection of many shorter paths through the network, each contributing to the final prediction. This robustness to layer removal — residual networks degrade gracefully when layers are dropped at test time — further supports this ensemble interpretation and underscores why residual connections have become a near-universal design pattern in modern deep learning.