A transformer sublayer applying identical linear transformations independently to each sequence position.
A point-wise feedforward network (FFN) is a neural network sublayer that applies the same two-layer fully connected transformation independently to every position in a sequence. Unlike attention mechanisms, which mix information across positions, the FFN treats each position in isolation — hence "point-wise" — making it embarrassingly parallel and computationally efficient. The standard formulation applies a linear projection that expands the input dimensionality by a factor (typically four), passes the result through a non-linear activation function such as ReLU or GELU, then projects back down to the original model dimension.
In the transformer architecture introduced by Vaswani et al. in 2017, each encoder and decoder block contains one point-wise FFN sublayer alongside the multi-head self-attention sublayer. The expansion-then-contraction structure gives the network a bottleneck-like capacity to learn rich intermediate representations at each position before compressing them back into the residual stream. Because the same weight matrices are shared across all positions within a layer, the FFN functions somewhat like a learned, position-agnostic feature extractor applied uniformly across the sequence.
The FFN sublayer plays a critical role in the overall expressiveness of transformer models. Research has shown that these layers store a surprising amount of factual and relational knowledge, effectively acting as key-value memories where the first linear layer retrieves patterns and the second writes updated representations. This insight has motivated architectural variants such as mixture-of-experts (MoE) layers, which replace the dense FFN with a sparse collection of expert networks, allowing models to scale parameters without proportionally increasing compute.
Point-wise feedforward networks matter because they account for a large fraction of a transformer's total parameters and computational cost, making them a primary target for efficiency research. Techniques like low-rank approximation, pruning, and quantization are frequently applied to FFN weights. Their design also influences how models are scaled: the ratio of FFN hidden dimension to model dimension is a key hyperparameter that practitioners tune when balancing capacity against memory and throughput constraints.