An activation function that outputs its input if positive, otherwise zero.
ReLU (Rectified Linear Unit) is an activation function defined as f(x) = max(0, x), meaning it passes positive values through unchanged while mapping all negative values to zero. This deceptively simple operation introduces the non-linearity that neural networks require to learn complex, real-world patterns — without it, stacking multiple layers would be mathematically equivalent to a single linear transformation. ReLU is applied element-wise after linear transformations within a network, and its piecewise-linear nature makes it trivially differentiable almost everywhere, which is essential for backpropagation-based training.
Before ReLU's widespread adoption, sigmoid and tanh functions dominated as activation choices. These saturating functions compress their outputs into bounded ranges, which causes gradients to shrink exponentially as they propagate backward through many layers — the so-called vanishing gradient problem. ReLU sidesteps this issue entirely for positive activations, where its gradient is a constant 1, allowing gradients to flow cleanly through deep networks. This property was a key enabler of the deep learning revolution, making it practical to train networks with dozens or even hundreds of layers.
ReLU also induces sparsity in network activations: because any negative pre-activation is zeroed out, typically only a fraction of neurons fire for any given input. This sparse representation is computationally efficient and has been argued to improve generalization by reducing co-adaptation among neurons. However, ReLU is not without drawbacks. The "dying ReLU" problem occurs when neurons receive consistently negative inputs and become permanently inactive, contributing nothing to the network's output or gradients. Variants such as Leaky ReLU, Parametric ReLU (PReLU), and ELU were developed to address this by allowing a small, non-zero gradient for negative inputs.
ReLU became the default activation function for deep neural networks following influential work around 2010–2011, most notably the paper Deep Sparse Rectifier Neural Networks by Glorot, Bordes, and Bengio, which rigorously demonstrated its advantages. Today it remains one of the most widely used activation functions in convolutional networks, feedforward networks, and many transformer architectures, serving as a foundational building block of modern deep learning.