Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Batch Size

Batch Size

The number of training examples processed together before updating model parameters.

Year: 1986Generality: 796
Back to Vocab

Batch size is a fundamental hyperparameter in machine learning that determines how many training examples are fed through a model in a single forward and backward pass before the model's weights are updated. Rather than updating parameters after every individual sample (stochastic gradient descent) or after the entire dataset (full-batch gradient descent), most modern training pipelines use mini-batches — a middle ground that balances computational efficiency with the quality of gradient estimates. Common batch sizes range from 16 to 512 samples, though the optimal choice depends heavily on the task, model architecture, and available hardware.

The mechanics of batch size are tightly coupled to gradient estimation. With a small batch, the gradient computed is a noisy approximation of the true gradient over the full dataset, which introduces stochasticity that can help the optimizer escape sharp local minima and often leads to better generalization. Larger batches produce smoother, more accurate gradient estimates and allow for greater parallelism on GPUs, but they tend to converge to sharper minima that generalize less well — a phenomenon extensively studied in the deep learning literature. This trade-off means that simply increasing batch size to maximize hardware utilization can silently degrade model quality.

Batch size also interacts critically with the learning rate. A common heuristic, sometimes called linear scaling, suggests increasing the learning rate proportionally when batch size is increased, to maintain comparable training dynamics. Techniques like learning rate warmup and gradient accumulation have emerged specifically to manage these interactions, particularly when training very large models where memory constraints force small effective batch sizes.

In practice, batch size selection is rarely treated in isolation. It influences training speed, memory consumption, gradient noise, and generalization, making it one of the first hyperparameters practitioners tune. With the rise of large-scale distributed training for foundation models, batch sizes have grown dramatically — sometimes into the millions of tokens — requiring sophisticated scheduling strategies to preserve convergence stability and downstream performance.

Related

Related

Batch
Batch

A fixed subset of training data processed together in one forward-backward pass.

Generality: 796
Batch Inference
Batch Inference

Running a trained model on many inputs simultaneously to generate predictions efficiently.

Generality: 694
Continuous Batching
Continuous Batching

A technique that dynamically groups incoming requests into batches for efficient ML inference.

Generality: 339
Batch Normalization
Batch Normalization

A technique that normalizes layer inputs to accelerate and stabilize neural network training.

Generality: 794
Step Size
Step Size

A hyperparameter controlling how large each parameter update is during optimization.

Generality: 720
Parameter Size
Parameter Size

The total count of learnable weights and biases in a machine learning model.

Generality: 694