Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Checkpoint

Checkpoint

A saved snapshot of a model's parameters and state during training.

Year: 2015Generality: 695
Back to Vocab

A checkpoint is a serialized snapshot of a machine learning model's parameters, optimizer state, and training metadata captured at a specific point during the training process. By periodically saving this information to disk, practitioners can resume an interrupted training run from where it left off rather than starting over — a critical safeguard given that training large models can take days or weeks on expensive hardware. Checkpoints also enable early stopping strategies, where training is halted once validation performance plateaus and the best-performing snapshot is retrieved for deployment.

In practice, checkpoints are saved either at fixed intervals — every N epochs or every N gradient steps — or conditionally, whenever a monitored metric such as validation loss reaches a new minimum. Modern frameworks like PyTorch and TensorFlow provide built-in utilities for this: torch.save and ModelCheckpoint callbacks, respectively, handle the serialization of weights and optimizer states in a format that can be reloaded seamlessly. A full checkpoint typically stores model weights, optimizer momentum buffers, learning rate scheduler state, and the current epoch or step count, ensuring that resumed training is numerically identical to uninterrupted training.

Beyond fault tolerance, checkpoints serve several other practical roles. They allow researchers to evaluate model behavior at intermediate stages of training, enabling analysis of how representations evolve over time. In transfer learning workflows, publicly released checkpoints — such as those for BERT, GPT, or ResNet — serve as pretrained starting points that dramatically reduce the compute required to adapt a model to a new task. This has made checkpoint sharing a cornerstone of the open-source ML ecosystem, with repositories like Hugging Face's Model Hub hosting thousands of community-contributed snapshots.

The importance of checkpointing scales directly with model size and training cost. For large language models trained on thousands of GPUs over months, losing progress due to a hardware failure without checkpoints would be catastrophically expensive. As a result, production training pipelines typically implement redundant checkpoint storage across multiple locations, with fine-grained control over retention policies to balance storage costs against recovery granularity.

Related

Related

Persistency
Persistency

Storing model states and learned behaviors so AI systems retain knowledge over time.

Generality: 591
Stop Conditions
Stop Conditions

Criteria that determine when a machine learning training process should terminate.

Generality: 575
Pretrained Model
Pretrained Model

A model trained on large data, reused or fine-tuned for new tasks.

Generality: 838
Early Stopping
Early Stopping

A regularization technique that halts model training when validation performance begins degrading.

Generality: 794
Benchmark
Benchmark

A standardized test used to measure and compare AI model performance.

Generality: 796
Open Weights
Open Weights

Publicly released model parameters that enable transparency, reproducibility, and collaborative AI development.

Generality: 694