Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Continuous Batching

Continuous Batching

A technique that dynamically groups incoming requests into batches for efficient ML inference.

Year: 2022Generality: 339
Back to Vocab

Continuous batching is a systems-level scheduling technique used in machine learning inference and training pipelines that dynamically assembles incoming requests or examples into batches as they arrive, rather than waiting for a fixed batch to fill or processing each item individually. Unlike static batching—where a server waits until a predetermined number of requests accumulates before processing—continuous batching inserts new requests into an ongoing computation cycle as soon as capacity becomes available. This allows the system to maintain high hardware utilization on GPUs or TPUs while keeping individual request latency bounded, striking a practical balance between throughput and responsiveness.

The mechanism works by maintaining a queue of pending inputs and a scheduler that monitors the state of in-flight computations. When a slot opens—such as when one sequence in a batch finishes generating tokens in a large language model—the scheduler immediately fills that slot with a waiting request rather than idling until the entire batch completes. This is especially impactful for autoregressive generation tasks, where different sequences finish at different steps, causing significant wasted compute under naive static batching. Frameworks like NVIDIA's Triton Inference Server and vLLM popularized this approach for LLM serving, where it is sometimes called iteration-level scheduling or in-flight batching.

Continuous batching matters because accelerator utilization is one of the primary cost drivers in production ML systems. Static batching can leave GPUs underutilized when requests arrive unevenly, while purely online processing wastes the parallelism that makes accelerators efficient in the first place. By keeping hardware continuously occupied with useful work, continuous batching can improve throughput by an order of magnitude for variable-length workloads compared to naïve alternatives.

Engineering continuous batching in practice requires careful attention to memory management, backpressure handling, and latency service-level objectives. Techniques like paged attention (used in vLLM) complement continuous batching by managing GPU memory more flexibly across variable-length sequences. The approach also intersects with streaming training and continual learning pipelines, where micro-batching policies must account for non-iid data, concept drift, and gradient staleness in asynchronous update settings.

Related

Related

Batch Inference
Batch Inference

Running a trained model on many inputs simultaneously to generate predictions efficiently.

Generality: 694
Batch
Batch

A fixed subset of training data processed together in one forward-backward pass.

Generality: 796
Batch Size
Batch Size

The number of training examples processed together before updating model parameters.

Generality: 796
Streaming
Streaming

Real-time, token-by-token delivery of model outputs as they are generated.

Generality: 450
Continuous Learning
Continuous Learning

AI systems that incrementally learn from new data without forgetting prior knowledge.

Generality: 713
Batch Normalization
Batch Normalization

A technique that normalizes layer inputs to accelerate and stabilize neural network training.

Generality: 794