A saved snapshot of a model's parameters and state during training.
A checkpoint is a serialized snapshot of a machine learning model's parameters, optimizer state, and training metadata captured at a specific point during the training process. By periodically saving this information to disk, practitioners can resume an interrupted training run from where it left off rather than starting over — a critical safeguard given that training large models can take days or weeks on expensive hardware. Checkpoints also enable early stopping strategies, where training is halted once validation performance plateaus and the best-performing snapshot is retrieved for deployment.
In practice, checkpoints are saved either at fixed intervals — every N epochs or every N gradient steps — or conditionally, whenever a monitored metric such as validation loss reaches a new minimum. Modern frameworks like PyTorch and TensorFlow provide built-in utilities for this: torch.save and ModelCheckpoint callbacks, respectively, handle the serialization of weights and optimizer states in a format that can be reloaded seamlessly. A full checkpoint typically stores model weights, optimizer momentum buffers, learning rate scheduler state, and the current epoch or step count, ensuring that resumed training is numerically identical to uninterrupted training.
Beyond fault tolerance, checkpoints serve several other practical roles. They allow researchers to evaluate model behavior at intermediate stages of training, enabling analysis of how representations evolve over time. In transfer learning workflows, publicly released checkpoints — such as those for BERT, GPT, or ResNet — serve as pretrained starting points that dramatically reduce the compute required to adapt a model to a new task. This has made checkpoint sharing a cornerstone of the open-source ML ecosystem, with repositories like Hugging Face's Model Hub hosting thousands of community-contributed snapshots.
The importance of checkpointing scales directly with model size and training cost. For large language models trained on thousands of GPUs over months, losing progress due to a hardware failure without checkpoints would be catastrophically expensive. As a result, production training pipelines typically implement redundant checkpoint storage across multiple locations, with fine-grained control over retention policies to balance storage costs against recovery granularity.