Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Quantization

Quantization

Reducing numerical precision of model weights and activations to shrink size and accelerate inference.

Year: 2015Generality: 794
Back to Vocab

Quantization is a model compression technique that reduces the numerical precision used to represent a neural network's weights and activations — typically from 32-bit floating point down to 8-bit integers, or even lower. This reduction directly shrinks the memory footprint of a model and allows hardware to perform faster arithmetic operations, since integer math is significantly cheaper than floating-point math on most processors. The result is a model that runs faster and consumes less power, making deployment feasible on resource-constrained devices like smartphones, microcontrollers, and edge accelerators.

There are two primary approaches to quantization. Post-training quantization (PTQ) applies precision reduction after a model has been fully trained, requiring only a small calibration dataset to estimate the range of activation values. It is fast and convenient but can introduce measurable accuracy degradation, especially at very low bit-widths. Quantization-aware training (QAT) takes a different approach by simulating quantization effects during the training process itself, allowing the model to adapt its weights to the constraints of reduced precision. QAT generally preserves accuracy more effectively than PTQ, at the cost of additional training time and complexity.

Quantization matters because the gap between the scale of modern neural networks and the compute budgets of real-world deployment targets is enormous. A large language model or vision transformer trained in full precision on data center hardware must often be compressed aggressively before it can run efficiently on consumer devices or serve low-latency production traffic. Quantization is one of the most practical tools for closing that gap, often achieving 2–4× speedups and equivalent memory reductions with minimal accuracy loss when applied carefully.

The technique has become a standard part of the ML deployment pipeline, supported natively by frameworks like TensorFlow Lite, PyTorch, and ONNX Runtime. Research continues to push toward lower bit-widths — 4-bit and even binary quantization — and to develop methods that are more robust across diverse model architectures, including large language models where quantization sensitivity varies significantly across layers.

Related

Related

TurboQuant
TurboQuant

A high-speed quantization framework for compressing neural networks with minimal accuracy loss.

Generality: 94
Low-Bit Palletization
Low-Bit Palletization

Reducing numerical precision of model weights to cut memory use and speed inference.

Generality: 485
BQL (Binary Quantization Learning)
BQL (Binary Quantization Learning)

A technique that compresses neural networks by reducing weights and activations to binary values.

Generality: 174
LAQ (Locally-Adaptive Quantization)
LAQ (Locally-Adaptive Quantization)

Quantization method that adjusts precision locally based on data characteristics for better efficiency.

Generality: 101
Model Compression
Model Compression

Techniques that shrink machine learning models while preserving predictive accuracy.

Generality: 795
Inference Acceleration
Inference Acceleration

Techniques and hardware that speed up neural network prediction without sacrificing accuracy.

Generality: 694