A gating mechanism that selectively controls information flow through neural network layers.
A Gated Linear Unit (GLU) is a neural network building block that uses a learned gating mechanism to regulate which information passes through a layer. The operation works by splitting an input tensor into two equal halves along the feature dimension: one half undergoes a standard linear transformation, while the other is passed through a sigmoid activation function to produce values between 0 and 1. These two outputs are then combined via element-wise multiplication, so the sigmoid-activated half acts as a soft gate that amplifies or suppresses individual features of the linearly transformed half. This selective filtering allows the network to dynamically emphasize relevant signals and suppress noise, without the computational overhead of more complex recurrent architectures.
GLUs were introduced in the 2017 paper "Language Modeling with Gated Convolutional Networks" by Yann Dauphin and colleagues, where they demonstrated that convolutional networks equipped with gating could match or outperform recurrent models on large-scale language modeling benchmarks. A key practical advantage was that, unlike LSTMs or GRUs, GLU-based convolutional networks could be parallelized efficiently during training, making them significantly faster. The gating mechanism also helps mitigate the vanishing gradient problem in deep networks, since gradients can flow more directly through the linear pathway during backpropagation.
Since their introduction, GLUs and their variants have become widely adopted across modern deep learning architectures. The SwiGLU variant — which replaces the sigmoid gate with the Swish activation function — has been incorporated into large language models such as LLaMA and PaLM, where it consistently improves performance over standard feed-forward layers. These variants follow the same structural logic but tune the nonlinearity of the gate to better suit the optimization landscape of very deep transformers.
GLUs matter because they offer a principled, parameter-efficient way to introduce conditional computation into neural networks. Rather than processing all features uniformly, a GLU layer learns to route information based on context, improving both representational capacity and training stability. Their compatibility with modern hardware and their demonstrated gains in large-scale language modeling have made them a standard component in state-of-the-art model designs.