An intermediate neural network layer that learns internal representations of data.
A hidden layer is any layer in a neural network that sits between the input layer and the output layer, invisible to the raw data and the final predictions alike. Each neuron in a hidden layer receives signals from the previous layer, multiplies them by learned weights, adds a bias term, and passes the result through a nonlinear activation function such as ReLU or sigmoid. This transformation allows the network to encode increasingly abstract features of the input — edges and textures in early vision layers, semantic concepts in deeper ones — rather than relying on hand-crafted feature engineering.
The power of hidden layers lies in their collective ability to approximate arbitrarily complex functions. By stacking multiple hidden layers, a network builds a hierarchy of representations where each layer refines the abstractions produced by the one before it. The universal approximation theorem formalizes this intuition, showing that even a single sufficiently wide hidden layer can represent any continuous function — though in practice, depth tends to be far more parameter-efficient than width for learning structured data.
The number, width, and connectivity pattern of hidden layers are among the most consequential architectural decisions in model design. Shallow networks with one or two hidden layers work well for tabular data and simpler tasks, while deep networks with dozens or hundreds of layers — as in ResNets or Transformers — are necessary for high-dimensional problems like image classification, speech recognition, and language modeling. Techniques like batch normalization, residual connections, and dropout were developed largely to make training deep stacks of hidden layers stable and effective.
Hidden layers became practically significant with the popularization of backpropagation in 1986, which provided an efficient algorithm for computing gradients through multiple layers and updating weights accordingly. Before this, training networks with more than one hidden layer was computationally intractable. The subsequent decades of research into activation functions, initialization schemes, and optimization algorithms have all been aimed at unlocking the representational capacity that hidden layers provide.