The number of training examples processed together before updating model parameters.
Batch size is a fundamental hyperparameter in machine learning that determines how many training examples are fed through a model in a single forward and backward pass before the model's weights are updated. Rather than updating parameters after every individual sample (stochastic gradient descent) or after the entire dataset (full-batch gradient descent), most modern training pipelines use mini-batches — a middle ground that balances computational efficiency with the quality of gradient estimates. Common batch sizes range from 16 to 512 samples, though the optimal choice depends heavily on the task, model architecture, and available hardware.
The mechanics of batch size are tightly coupled to gradient estimation. With a small batch, the gradient computed is a noisy approximation of the true gradient over the full dataset, which introduces stochasticity that can help the optimizer escape sharp local minima and often leads to better generalization. Larger batches produce smoother, more accurate gradient estimates and allow for greater parallelism on GPUs, but they tend to converge to sharper minima that generalize less well — a phenomenon extensively studied in the deep learning literature. This trade-off means that simply increasing batch size to maximize hardware utilization can silently degrade model quality.
Batch size also interacts critically with the learning rate. A common heuristic, sometimes called linear scaling, suggests increasing the learning rate proportionally when batch size is increased, to maintain comparable training dynamics. Techniques like learning rate warmup and gradient accumulation have emerged specifically to manage these interactions, particularly when training very large models where memory constraints force small effective batch sizes.
In practice, batch size selection is rarely treated in isolation. It influences training speed, memory consumption, gradient noise, and generalization, making it one of the first hyperparameters practitioners tune. With the rise of large-scale distributed training for foundation models, batch sizes have grown dramatically — sometimes into the millions of tokens — requiring sophisticated scheduling strategies to preserve convergence stability and downstream performance.