Continuous batching

Continuous batching

A streaming-friendly approach that continuously groups incoming data into micro-batches to trade off latency and hardware efficiency for training and inference in ML systems.

A streaming-friendly approach that continuously accumulates incoming examples into small, regularly emitted micro-batches so they can be processed with higher throughput and lower per-item overhead than pure online updates while keeping latency bounded.

Continuous batching is a systems-level strategy used in ML (Machine Learning) pipelines and serving stacks to reconcile the competing needs of low-latency responsiveness and high hardware utilization. Instead of processing each example individually (online learning) or waiting to assemble very large static batches, a continuous-batching system buffers inputs into micro-batches based on size, time window, or adaptive policies, then processes those micro-batches on accelerators (GPUs/TPUs) or across distributed workers. This improves I/O and kernel efficiency, reduces framework overhead, and can produce a more stable gradient signal (or higher amortized throughput for inference) while keeping end-to-end latency acceptable for production SLOs. From an optimization perspective, continuous batching changes the effective batch size, interacts with learning-rate scaling rules, and can introduce gradient staleness in asynchronous setups — all of which require careful tuning (gradient accumulation, optimizer scheduling, normalization choices such as LayerNorm or GroupNorm instead of BatchNorm, and bounded-staleness protocols) to preserve convergence and generalization. In practice it appears in streaming training and continual learning, real-time personalization and recommendation systems, and production inference pipelines (micro-batching in serving frameworks and stream processors). Engineering considerations include backpressure handling, adaptive batch sizing to meet latency SLOs, sample prioritization, and correctness issues for stateful preprocessing or non-iid data and concept drift.

First common uses date to the early 2010s with micro-batching in stream-processing systems (e.g., Spark Streaming ~2013); the pattern gained broader popularity in the mid-to-late 2010s and into the early 2020s as production ML systems and real-time serving/training needs pushed teams to balance latency and accelerator efficiency.