Techniques that shrink machine learning models while preserving predictive accuracy.
Model compression refers to a family of techniques aimed at reducing the size, memory footprint, and computational cost of machine learning models without substantially degrading their performance. As deep learning architectures grew dramatically in scale throughout the 2010s, deploying state-of-the-art models on resource-constrained hardware — mobile phones, embedded systems, IoT sensors, and edge devices — became a critical engineering challenge. Model compression emerged as the primary solution, enabling powerful AI capabilities to run efficiently outside of data centers.
The field encompasses several distinct but complementary approaches. Pruning removes redundant or low-importance weights and neurons from a trained network, often with minimal accuracy loss. Quantization reduces the numerical precision of weights and activations — for example, converting 32-bit floating-point values to 8-bit integers — dramatically shrinking model size and accelerating inference on compatible hardware. Knowledge distillation, popularized by Geoffrey Hinton and colleagues in 2015, trains a smaller "student" model to mimic the output distributions of a larger "teacher" model, transferring learned representations more efficiently than training from scratch. Low-rank factorization decomposes large weight matrices into smaller components, reducing parameter counts while preserving expressive capacity.
These techniques are frequently combined in practice. A model might first be pruned, then quantized, and finally distilled into a compact architecture — each step compounding the efficiency gains. Neural architecture search (NAS) has also been applied to discover inherently efficient architectures, such as MobileNet and EfficientNet, that achieve strong performance with far fewer parameters than general-purpose designs.
Model compression matters beyond hardware constraints. Smaller models consume less energy, reducing the environmental footprint of AI inference at scale. They also exhibit lower latency, which is critical for real-time applications like speech recognition, autonomous driving, and augmented reality. As foundation models continue to grow into the billions and trillions of parameters, compression techniques have become indispensable for making modern AI practical, affordable, and broadly deployable across the full spectrum of computing environments.