Introduction
Random Forests: ensemble supervised learning algorithm combining multiple decision trees. Purpose: enhance prediction accuracy, reduce overfitting. Utilized for classification, regression tasks. Mechanism: bagging, feature randomness, majority voting or averaging.
"Random forests correct for decision trees' habit of overfitting to their training set." -- Leo Breiman
Background and Origins
Decision Trees
Single-tree models: hierarchical binary splits on features. Strength: interpretability. Weakness: prone to high variance, overfitting.
Bagging Concept
Bootstrap aggregating (bagging): sampling with replacement to build multiple models. Reduces variance, stabilizes predictions.
Random Forest Development
Invented by Leo Breiman (2001). Innovation: random feature selection at split nodes. Balances bias-variance tradeoff. Widely adopted in machine learning.
Algorithm Structure
Ensemble Composition
Forest: collection of N decision trees. Each tree trained on bootstrap sample. Predictions aggregated across trees.
Random Feature Selection
At each split, subset of features randomly selected. Split chosen from subset, not full feature set. Enhances tree diversity.
Tree Depth and Splitting
Trees grown to maximal depth or until minimum node size. No pruning in standard random forest.
Training Process
Bootstrap Sampling
Sample size equal to original data with replacement. Typically ~63% unique samples per tree.
Node Splitting
At each node: select m features at random. Evaluate splitting criteria (e.g., Gini impurity, entropy, MSE). Choose best split.
Stopping Criteria
Stop when node pure, minimal samples reached, or maximum depth met. No pruning performed.
Out-of-Bag Samples
Samples not included in bootstrap for a tree used for internal validation. Enables unbiased error estimation.
Prediction Methods
Classification
Each tree votes for class label. Final prediction: majority vote across trees.
Regression
Each tree outputs numeric value. Final prediction: average of all tree outputs.
Probability Estimation
Classification probabilities computed via normalized vote counts per class.
Out-of-Bag Prediction
Predictions aggregated only from trees where sample was OOB. Provides cross-validated predictions.
Feature Importance
Mean Decrease Impurity (MDI)
Sum of impurity decreases weighted by probability of reaching node, averaged over trees. Indicates feature's contribution to splits.
Permutation Importance
Randomly permute feature values; measure increase in prediction error. Higher increase: more important feature.
Interpretability
Feature importance aids model understanding, feature selection, and dimensionality reduction.
Advantages and Limitations
Advantages
Robust to noise and overfitting. Handles high-dimensional data. Requires minimal preprocessing. Provides feature importance metrics.
Limitations
Less interpretable than single trees. Computationally expensive with large forests. Can struggle with highly imbalanced data.
Bias-Variance Tradeoff
Reduces variance via averaging; bias may increase slightly due to randomness.
Hyperparameter Tuning
Number of Trees (n_estimators)
More trees improve performance but increase computation. Diminishing returns after certain size.
Number of Features to Consider (max_features)
Typical defaults: sqrt(p) for classification, p/3 for regression (p = total features). Controls tree correlation.
Tree Depth (max_depth)
Restricts overfitting, reduces complexity. Often left unlimited to allow full growth.
Minimum Sample Splits and Leaves
Controls granularity of splits. Larger values increase bias, reduce variance.
Applications
Classification Tasks
Medical diagnosis, spam detection, image recognition, credit scoring.
Regression Tasks
House price prediction, stock price forecasting, environmental modeling.
Anomaly Detection
Identify outliers via proximity measures derived from forest structure.
Feature Selection
Rank features by importance to reduce dimensionality in pre-processing.
Performance Comparison
Vs Decision Trees
Random forests outperform single trees by reducing overfitting and variance.
Vs Gradient Boosting
Random forests faster to train, less sensitive to hyperparameters. Gradient boosting often achieves higher accuracy but risks overfitting.
Vs Support Vector Machines
Random forests handle large feature sets better, natively handle multiclass tasks. SVMs require kernel and parameter tuning.
Empirical Results
| Algorithm | Accuracy (%) | Training Time (s) |
|---|---|---|
| Decision Tree | 85.4 | 0.3 |
| Random Forest | 91.2 | 1.5 |
| Gradient Boosting | 92.5 | 4.2 |
| SVM | 89.8 | 2.8 |
Implementation Details
Data Structures
Trees stored as linked nodes or arrays. Efficient split evaluation critical for performance.
Splitting Criteria
Classification: Gini impurity or entropy. Regression: mean squared error (MSE).
Parallelization
Trees trained independently; ideal for parallel and distributed computing environments.
Out-of-Bag Error Estimation
Calculates unbiased error using OOB samples avoiding need for separate validation set.
Algorithm RandomForestTrain(D, N, m): Input: Dataset D, Number of trees N, Features per split m Output: Forest F Initialize Forest F = {} For i = 1 to N: Sample D_i with replacement from D Train decision tree T_i on D_i with random m features at each split Add T_i to F Return F Recent Advances
Extremely Randomized Trees (ExtraTrees)
More randomization in split thresholds. Faster, sometimes better generalization.
Oblique Random Forests
Use linear combinations of features for splits. Capture complex patterns.
Hybrid Models
Combine random forests with deep learning, boosting for improved accuracy.
Scalability Improvements
Optimized implementations for large-scale data: streaming, distributed frameworks.
References
- Breiman, L., "Random Forests," Machine Learning, vol. 45, no. 1, 2001, pp. 5-32.
- Ho, T.K., "Random Decision Forests," Proceedings of the 3rd International Conference on Document Analysis and Recognition, 1995, pp. 278-282.
- Cutler, D.R., et al., "Random Forests for Classification in Ecology," Ecology, vol. 88, no. 11, 2007, pp. 2783-2792.
- Liaw, A., Wiener, M., "Classification and Regression by randomForest," R News, vol. 2, no. 3, 2002, pp. 18-22.
- Geurts, P., Ernst, D., Wehenkel, L., "Extremely Randomized Trees," Machine Learning, vol. 63, no. 1, 2006, pp. 3-42.