An ensemble of decision trees that improves accuracy and resists overfitting.
Random Forest is an ensemble learning method that builds a large collection of decision trees during training and aggregates their outputs to produce a final prediction. For classification tasks, the algorithm returns the majority vote across all trees; for regression, it returns the mean of individual tree predictions. This aggregation strategy, known as bagging, dramatically reduces the variance that plagues individual decision trees without substantially increasing bias, yielding a model that generalizes far better to unseen data.
The algorithm introduces randomness at two distinct levels to ensure that the constituent trees remain diverse and decorrelated. First, each tree is trained on a bootstrap sample — a random subset of the training data drawn with replacement. Second, at every node split, only a random subset of features is considered as candidates for the best split, rather than evaluating all available features. This feature subsampling is the key innovation that distinguishes Random Forest from simple bagging of decision trees, and it prevents any single dominant feature from appearing in every tree, forcing the ensemble to explore a wider range of predictive patterns.
Random Forest became a cornerstone of practical machine learning after Leo Breiman formalized and published the algorithm in 2001, demonstrating its superior accuracy and robustness across a wide range of benchmark tasks. It requires minimal hyperparameter tuning compared to many competing methods, handles high-dimensional data gracefully, and is naturally resistant to overfitting even as the number of trees grows. These properties made it one of the most widely adopted algorithms in applied data science before the deep learning era, and it remains highly competitive on structured tabular data today.
Beyond raw predictive performance, Random Forest provides a built-in mechanism for estimating feature importance by measuring how much each variable reduces impurity across all splits in all trees. This interpretability makes it valuable not just as a predictive tool but as an exploratory instrument for understanding which input variables carry the most predictive signal — a property that has made it popular in domains such as genomics, finance, and clinical medicine where understanding model decisions is as important as accuracy.