A technique that dynamically groups incoming requests into batches for efficient ML inference.
Continuous batching is a systems-level scheduling technique used in machine learning inference and training pipelines that dynamically assembles incoming requests or examples into batches as they arrive, rather than waiting for a fixed batch to fill or processing each item individually. Unlike static batching—where a server waits until a predetermined number of requests accumulates before processing—continuous batching inserts new requests into an ongoing computation cycle as soon as capacity becomes available. This allows the system to maintain high hardware utilization on GPUs or TPUs while keeping individual request latency bounded, striking a practical balance between throughput and responsiveness.
The mechanism works by maintaining a queue of pending inputs and a scheduler that monitors the state of in-flight computations. When a slot opens—such as when one sequence in a batch finishes generating tokens in a large language model—the scheduler immediately fills that slot with a waiting request rather than idling until the entire batch completes. This is especially impactful for autoregressive generation tasks, where different sequences finish at different steps, causing significant wasted compute under naive static batching. Frameworks like NVIDIA's Triton Inference Server and vLLM popularized this approach for LLM serving, where it is sometimes called iteration-level scheduling or in-flight batching.
Continuous batching matters because accelerator utilization is one of the primary cost drivers in production ML systems. Static batching can leave GPUs underutilized when requests arrive unevenly, while purely online processing wastes the parallelism that makes accelerators efficient in the first place. By keeping hardware continuously occupied with useful work, continuous batching can improve throughput by an order of magnitude for variable-length workloads compared to naïve alternatives.
Engineering continuous batching in practice requires careful attention to memory management, backpressure handling, and latency service-level objectives. Techniques like paged attention (used in vLLM) complement continuous batching by managing GPU memory more flexibly across variable-length sequences. The approach also intersects with streaming training and continual learning pipelines, where micro-batching policies must account for non-iid data, concept drift, and gradient staleness in asynchronous update settings.