Unwanted variation in data or signals that degrades machine learning model performance.
In machine learning, noise refers to any component of data that obscures the true underlying signal a model is trying to learn. It can arise from measurement errors, data entry mistakes, sensor imprecision, irrelevant features, or natural stochasticity in the phenomenon being modeled. Because supervised learning algorithms attempt to find patterns that generalize beyond the training set, noise poses a fundamental challenge: a model that learns noise as if it were signal will perform well on training data but fail on new examples, a problem known as overfitting. Distinguishing meaningful signal from noise is therefore central to building models that generalize reliably.
Noise affects machine learning systems at multiple levels. At the data level, label noise—where training examples are assigned incorrect class labels—can systematically mislead a classifier, while feature noise introduces spurious correlations that inflate apparent model complexity. At the optimization level, stochastic gradient descent deliberately introduces noise by computing gradients on random mini-batches rather than the full dataset; this controlled randomness helps models escape sharp local minima and find flatter, more generalizable solutions. Regularization techniques such as dropout and weight decay can also be understood as mechanisms for making models robust to noise by discouraging over-reliance on any single feature or parameter.
Counterintuitively, noise is sometimes injected deliberately to improve model robustness. Data augmentation adds controlled perturbations—random crops, flips, or color jitter in image tasks—to artificially expand training diversity and reduce sensitivity to irrelevant variation. Denoising autoencoders are trained explicitly to reconstruct clean inputs from corrupted versions, forcing the network to learn robust internal representations. In diffusion models, the entire generative process is framed as learning to reverse a gradual noise-addition process, making noise not a nuisance but the core mechanism of generation. Understanding and managing noise thus remains one of the most practically consequential skills in applied machine learning.