Using a trained model to generate predictions or decisions on new, unseen data.
Inference is the deployment phase of a machine learning model, where a system trained on historical data is applied to new, unseen inputs to produce predictions, classifications, or decisions. Unlike training—which iteratively adjusts model parameters through backpropagation and gradient descent—inference is a forward-only pass through the network, making it computationally lighter but still demanding at scale. The model's weights are frozen during inference, meaning it draws entirely on patterns learned during training to interpret novel inputs.
The mechanics of inference vary significantly depending on the deployment context. In batch inference, large volumes of data are processed offline in groups, which is common in recommendation systems or periodic analytics pipelines. Real-time inference, by contrast, requires low-latency responses to individual inputs and is essential for applications like autonomous driving, voice assistants, and fraud detection. Optimizing inference speed and memory footprint has become a major engineering discipline, giving rise to techniques such as model quantization, pruning, knowledge distillation, and hardware-specific compilation via tools like TensorRT or ONNX Runtime.
Inference efficiency matters enormously in production environments because it directly affects cost, latency, and accessibility. Running large language models or vision transformers at scale can consume substantial compute resources, driving demand for specialized inference hardware such as GPUs, TPUs, and edge accelerators. The gap between a model that performs well in research and one that can be served reliably in production is often bridged by inference optimization work. As AI systems are embedded into more latency-sensitive and resource-constrained environments—from smartphones to IoT sensors—the ability to perform fast, accurate inference has become as strategically important as model accuracy itself.