An ensemble method that trains a meta-model on the outputs of multiple base models.
Stacking, short for stacked generalization, is an ensemble learning technique that combines the predictions of multiple base models by training a higher-level meta-model to synthesize their outputs into a final prediction. Rather than simply averaging or voting across models, stacking learns how to best weight and integrate each base model's contribution, allowing the meta-model to exploit complementary strengths across diverse learners. The base models can be any combination of algorithms — decision trees, support vector machines, neural networks, or others — making stacking highly flexible in practice.
The training process typically uses a cross-validation scheme to prevent data leakage. Each base model is trained on a subset of the training data and generates out-of-fold predictions on the held-out portion. These out-of-fold predictions are then assembled into a new feature matrix, which becomes the training data for the meta-model. At inference time, all base models generate predictions on the test set, and those predictions are passed to the meta-model to produce the final output. This careful construction ensures the meta-model is trained on predictions that reflect how the base models generalize, not just how they memorize.
Stacking matters because it consistently outperforms individual models and simpler ensemble methods like bagging or boosting in many settings, particularly when the base models are diverse and make different types of errors. It became a staple technique in machine learning competitions, where stacked ensembles frequently topped leaderboards on platforms like Kaggle. The method is especially powerful when combining models that capture different aspects of the data — for example, pairing a gradient boosting model strong on tabular structure with a neural network that captures nonlinear interactions.
The primary trade-offs are increased computational cost and complexity. Training and maintaining multiple models, along with a meta-model, requires more resources and careful engineering. Interpretability also suffers, as the final predictions emerge from a layered system of models rather than a single transparent function. Despite these challenges, stacking remains one of the most reliable tools for squeezing out predictive performance in high-stakes modeling tasks.