A single hidden-layer neural network can approximate any continuous function arbitrarily well.
The universal approximation theorem is a foundational result in neural network theory stating that a feedforward network with a single hidden layer of sufficient width can approximate any continuous function on a compact domain to arbitrary precision. More formally, given any continuous target function and any error tolerance ε > 0, there exist network weights such that the maximum deviation between the network's output and the target function is less than ε. This holds for a broad class of activation functions — originally proved for sigmoidal activations by Cybenko and Hornik et al. in 1989, and later extended to ReLU and virtually any nonpolynomial activation function.
The theorem is an existence result, not a constructive one. It guarantees that a sufficiently wide shallow network has the representational capacity to express a given function, but says nothing about how many neurons are actually needed, whether gradient-based training will find the right weights, or how well the learned function generalizes to unseen data. In practice, the number of neurons required for a shallow network to approximate complex functions can be exponentially large, which is one reason deep architectures are preferred — depth provides exponential gains in parameter efficiency for many function classes.
Modern extensions of the theorem have significantly enriched its practical relevance. Researchers have studied width-depth tradeoffs, showing that deeper networks can represent certain functions far more compactly than shallow ones. Work by Telgarsky, Mhaskar, Poggio, Hanin, and others has quantified approximation rates, identified function classes where depth provably helps, and established minimum width requirements for universality with specific activations like ReLU. These results help explain empirically observed advantages of deep learning architectures.
For practitioners and theorists alike, the universal approximation theorem serves as a conceptual anchor: it establishes that neural networks are not fundamentally limited in what they can represent, shifting the key questions to optimization, generalization, and architectural efficiency. It remains one of the most cited theoretical justifications for using neural networks as general-purpose function approximators across domains ranging from computer vision to scientific simulation.