Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Evaluation-Time Compute

Evaluation-Time Compute

Computational resources consumed when an AI model runs inference on new data.

Year: 2012Generality: 627
Back to Vocab

Evaluation-time compute refers to the computational resources expended when a trained AI model is actively used to generate predictions, classifications, or decisions on new inputs. This is distinct from training-time compute, which involves iterative parameter updates over large datasets. During inference, the model's weights are fixed, and the system performs a forward pass through the network to produce an output. The efficiency of this process determines how quickly and cheaply a model can respond in production environments.

The mechanics of evaluation-time compute depend heavily on model architecture and deployment context. A large transformer model performing autoregressive text generation, for instance, must execute sequential forward passes for each output token, making latency a central concern. Convolutional networks used in image classification can often be parallelized more aggressively, but still face memory bandwidth and arithmetic throughput constraints. Techniques such as quantization, pruning, knowledge distillation, and operator fusion are commonly applied to reduce the computational footprint at inference time without substantially degrading accuracy.

The importance of evaluation-time compute has grown as AI models are deployed across an increasingly diverse range of hardware environments—from data center GPUs to mobile phones, embedded microcontrollers, and purpose-built inference accelerators. In latency-sensitive applications like autonomous driving, real-time translation, or fraud detection, the cost and speed of a single inference can be a hard engineering constraint. This has driven the development of specialized hardware such as Google's TPU inference chips and NVIDIA's TensorRT optimization stack, as well as model families explicitly designed for efficient deployment, including MobileNet and DistilBERT.

More recently, the concept has gained renewed attention with the rise of large language models and test-time compute scaling, where models are deliberately given more computation at inference time—through techniques like chain-of-thought reasoning or search-based decoding—to improve output quality. This represents a deliberate inversion of the traditional goal of minimizing inference cost, and has opened new research directions around how to allocate evaluation-time compute most effectively to maximize model performance.

Related

Related

Compute
Compute

The processing power and hardware resources required to train and run AI models.

Generality: 875
Inference-Time Reasoning
Inference-Time Reasoning

A trained model's process of applying learned knowledge to generate outputs on new data.

Generality: 751
Compute Efficiency
Compute Efficiency

How effectively a system converts computational resources into useful model performance.

Generality: 702
Training Compute
Training Compute

The total computational resources consumed while training a machine learning model.

Generality: 650
Inference Acceleration
Inference Acceleration

Techniques and hardware that speed up neural network prediction without sacrificing accuracy.

Generality: 694
TTC (Test-Time Compute)
TTC (Test-Time Compute)

Allocating additional computational resources during inference to improve reasoning and output quality

Generality: 689