Computational resources consumed when an AI model runs inference on new data.
Evaluation-time compute refers to the computational resources expended when a trained AI model is actively used to generate predictions, classifications, or decisions on new inputs. This is distinct from training-time compute, which involves iterative parameter updates over large datasets. During inference, the model's weights are fixed, and the system performs a forward pass through the network to produce an output. The efficiency of this process determines how quickly and cheaply a model can respond in production environments.
The mechanics of evaluation-time compute depend heavily on model architecture and deployment context. A large transformer model performing autoregressive text generation, for instance, must execute sequential forward passes for each output token, making latency a central concern. Convolutional networks used in image classification can often be parallelized more aggressively, but still face memory bandwidth and arithmetic throughput constraints. Techniques such as quantization, pruning, knowledge distillation, and operator fusion are commonly applied to reduce the computational footprint at inference time without substantially degrading accuracy.
The importance of evaluation-time compute has grown as AI models are deployed across an increasingly diverse range of hardware environments—from data center GPUs to mobile phones, embedded microcontrollers, and purpose-built inference accelerators. In latency-sensitive applications like autonomous driving, real-time translation, or fraud detection, the cost and speed of a single inference can be a hard engineering constraint. This has driven the development of specialized hardware such as Google's TPU inference chips and NVIDIA's TensorRT optimization stack, as well as model families explicitly designed for efficient deployment, including MobileNet and DistilBERT.
More recently, the concept has gained renewed attention with the rise of large language models and test-time compute scaling, where models are deliberately given more computation at inference time—through techniques like chain-of-thought reasoning or search-based decoding—to improve output quality. This represents a deliberate inversion of the traditional goal of minimizing inference cost, and has opened new research directions around how to allocate evaluation-time compute most effectively to maximize model performance.