Reducing numerical precision of model weights and activations to shrink size and accelerate inference.
Quantization is a model compression technique that reduces the numerical precision used to represent a neural network's weights and activations — typically from 32-bit floating point down to 8-bit integers, or even lower. This reduction directly shrinks the memory footprint of a model and allows hardware to perform faster arithmetic operations, since integer math is significantly cheaper than floating-point math on most processors. The result is a model that runs faster and consumes less power, making deployment feasible on resource-constrained devices like smartphones, microcontrollers, and edge accelerators.
There are two primary approaches to quantization. Post-training quantization (PTQ) applies precision reduction after a model has been fully trained, requiring only a small calibration dataset to estimate the range of activation values. It is fast and convenient but can introduce measurable accuracy degradation, especially at very low bit-widths. Quantization-aware training (QAT) takes a different approach by simulating quantization effects during the training process itself, allowing the model to adapt its weights to the constraints of reduced precision. QAT generally preserves accuracy more effectively than PTQ, at the cost of additional training time and complexity.
Quantization matters because the gap between the scale of modern neural networks and the compute budgets of real-world deployment targets is enormous. A large language model or vision transformer trained in full precision on data center hardware must often be compressed aggressively before it can run efficiently on consumer devices or serve low-latency production traffic. Quantization is one of the most practical tools for closing that gap, often achieving 2–4× speedups and equivalent memory reductions with minimal accuracy loss when applied carefully.
The technique has become a standard part of the ML deployment pipeline, supported natively by frameworks like TensorFlow Lite, PyTorch, and ONNX Runtime. Research continues to push toward lower bit-widths — 4-bit and even binary quantization — and to develop methods that are more robust across diverse model architectures, including large language models where quantization sensitivity varies significantly across layers.