A regularization technique that halts model training when validation performance begins degrading.
Early stopping is a regularization strategy that prevents overfitting by terminating the training process once a model's performance on a held-out validation set stops improving or begins to worsen. During training, most iterative learning algorithms—particularly those used for deep neural networks—will continue to reduce loss on training data even after the model has begun memorizing noise and idiosyncrasies specific to that data. Early stopping detects this inflection point by tracking a validation metric such as loss or accuracy across epochs, and halting training when that metric fails to improve for a specified number of consecutive steps, a threshold commonly called the "patience" parameter.
In practice, early stopping is implemented by saving model checkpoints throughout training and restoring the weights from the epoch that achieved the best validation performance. This ensures the final model reflects the point of optimal generalization rather than the endpoint of training. The technique is model-agnostic and applies broadly across gradient-based learning methods, including training of feedforward networks, recurrent networks, and gradient boosting machines. It is often used in conjunction with other regularization methods such as dropout or weight decay.
Early stopping matters because it addresses a fundamental tension in supervised learning: a model trained long enough will always fit its training data well, but this does not guarantee good performance on new examples. By treating the number of training iterations as a hyperparameter implicitly controlled through validation feedback, early stopping provides an automatic and computationally efficient mechanism for model selection. It also reduces training time and resource consumption by avoiding unnecessary epochs after the model has effectively converged to its best generalizable state.
The technique gained formal attention in the machine learning literature during the mid-1990s, with Lutz Prechelt's 1996 empirical study providing one of the most cited systematic analyses of early stopping criteria for neural networks. Its adoption accelerated dramatically with the deep learning renaissance of the 2010s, when training times for large networks made any mechanism that could reduce wasted computation especially valuable.