A principle where models use mostly zero values to improve efficiency.
Sparsity refers to the property of a model or data representation in which the vast majority of values are zero or near-zero, with only a small fraction carrying meaningful information. In machine learning, this principle appears in two related but distinct contexts: sparse data, where input features are mostly absent or zero (as in text represented by word counts), and sparse models, where most parameters or activations are suppressed to zero. Both forms reduce the effective complexity of a system, enabling faster computation and lower memory consumption without necessarily sacrificing predictive power.
Sparsity can be achieved through several mechanisms. In neural networks, pruning removes weights that fall below a significance threshold, leaving a leaner network that approximates the original. Sparse activations arise when activation functions like ReLU output zero for negative inputs, naturally silencing many neurons during a forward pass. Regularization techniques such as L1 (Lasso) penalize the absolute magnitude of weights, pushing many toward exactly zero during training. Mixture-of-experts architectures take this further by routing each input through only a small subset of specialized subnetworks, achieving massive model capacity with sparse computation per example.
The practical importance of sparsity has grown alongside the scale of modern machine learning. As models expanded to billions of parameters, the cost of dense computation became prohibitive. Sparse methods allow practitioners to deploy capable models on constrained hardware—mobile devices, embedded systems, or edge servers—and to train larger architectures within fixed compute budgets. Sparse attention mechanisms in transformers, for instance, reduce the quadratic cost of attending over long sequences by restricting each token to a local or sampled subset of positions.
Beyond efficiency, sparsity often improves interpretability and generalization. A model that relies on fewer active features is easier to inspect and less prone to overfitting noisy or irrelevant inputs. This dual benefit—computational and statistical—makes sparsity one of the most broadly applicable principles in machine learning, relevant to optimization, architecture design, compression, and theoretical analysis alike.