A resampling technique that estimates how well a model generalizes to unseen data.
Cross-validation is a foundational model evaluation technique in machine learning that estimates how well a trained model will perform on independent, unseen data. Rather than relying on a single train-test split—which can produce misleading results depending on how the data happens to be divided—cross-validation systematically rotates which portion of the data is held out for evaluation. The most widely used variant, k-fold cross-validation, partitions the dataset into k equally sized subsets. The model is trained k times, each time using a different fold as the validation set and the remaining k−1 folds as training data. Performance metrics are then averaged across all k runs, yielding a more stable and reliable estimate of generalization ability.
Beyond standard k-fold, several specialized variants exist to handle different data conditions. Stratified k-fold preserves the class distribution within each fold, making it essential for imbalanced classification problems. Leave-one-out cross-validation (LOOCV) is an extreme case where k equals the number of samples, useful when data is very scarce but computationally expensive. Time-series data requires walk-forward or rolling-window validation to respect temporal ordering and prevent data leakage from future observations into past training windows.
Cross-validation plays a critical role in the model selection and hyperparameter tuning pipeline. By comparing cross-validated scores across different model architectures or parameter settings, practitioners can make principled choices without overfitting their decisions to a fixed test set. It also provides diagnostic information: high variance across folds suggests the model is sensitive to the specific training data, while consistently poor scores across folds indicate underfitting. Nested cross-validation—where an outer loop estimates generalization error and an inner loop tunes hyperparameters—offers an unbiased evaluation when both tasks must be performed on the same dataset.
The practical importance of cross-validation grows when labeled data is limited, as it allows nearly all available examples to contribute to both training and evaluation. In modern deep learning, where datasets are often large and training is expensive, simpler held-out validation sets are common, but cross-validation remains the gold standard for tabular data, scientific applications, and any setting where reliable performance estimates are critical.