Overview

Definition: Decision trees are flowchart-like structures used for supervised learning tasks. Purpose: model decision rules to classify or predict output variables. Input: feature vectors with labeled outcomes. Output: tree of decisions leading to predictions. Characteristics: interpretable, non-parametric, handle categorical and continuous data. Use cases: classification, regression, feature selection, anomaly detection.

"Decision trees are one of the most intuitive and widely used models in machine learning, combining simplicity with power." -- J. Quinlan

Historical Context

Early development: 1960s, basic classification trees. Key contributors: Ross Quinlan (ID3, C4.5), Breiman et al. (CART). Evolution: from simple trees to ensemble methods (Random Forests, Gradient Boosted Trees).

Supervised Learning Context

Role: map inputs to outputs using labeled data. Learning: recursive partitioning to minimize impurity. Goal: maximize predictive accuracy while maintaining interpretability.

Key Concepts

Nodes: decision points or leaves. Branches: outcomes of splits. Root: start of the tree. Leaves: terminal nodes with prediction values.

Tree Structure

Nodes and Branches

Node types: internal (decision nodes), leaf (terminal nodes). Branches: connect nodes, represent feature test outcomes. Tree traversal: from root to leaves along branches.

Root Node

Definition: topmost node representing entire dataset. Function: initial feature split based on highest information gain or impurity reduction.

Internal Nodes

Function: test a feature condition (e.g., attribute ≤ threshold). Each internal node splits data into subsets.

Leaf Nodes

Represent final prediction. Classification: majority class. Regression: mean target value.

Depth and Size

Depth: longest path from root to leaf. Size: total nodes. Trade-off: deeper trees fit data better but risk overfitting.

ComponentDescription
Root NodeStarting point, entire dataset
Internal NodesFeature tests splitting data
BranchesOutcomes of tests, connect nodes
Leaf NodesFinal prediction or class label

Construction Algorithms

ID3 Algorithm

Inventor: Ross Quinlan (1986). Approach: top-down, greedy, uses information gain for split selection. Limitations: handles only categorical features, prone to overfitting.

C4.5 Algorithm

Extension of ID3. Supports continuous and missing values. Uses gain ratio to avoid bias towards features with many values. Includes pruning to mitigate overfitting.

CART Algorithm

Classification and Regression Trees by Breiman et al. Uses Gini impurity for classification splits. Binary splits only. Supports regression by minimizing squared error.

Recursive Partitioning

Mechanism: split data based on best feature criterion. Recursion: build subtrees on subsets until stopping condition met.

Stopping Criteria

Minimum samples per leaf. Maximum depth. No further impurity reduction. All samples belong to one class.

function buildTree(data): if stopping_condition(data): return leaf_node(data) best_feature = select_best_feature(data) splits = partition_data(data, best_feature) node = create_node(best_feature) for subset in splits: child = buildTree(subset) node.add_branch(child) return node

Splitting Criteria

Information Gain

Definition: reduction in entropy after split. Formula: IG = Entropy(parent) - weighted sum Entropy(children). Preference: features that maximally reduce uncertainty.

Entropy

Measure of disorder or impurity. Formula for binary classification:

Entropy(S) = -p₁ log₂ p₁ - p₂ log₂ p₂

Gini Impurity

Alternative metric used in CART. Measures probability of misclassification. Formula:

Gini(S) = 1 - Σ pᵢ²

Gain Ratio

Normalized information gain. Addresses bias towards multi-valued features. Formula:

GainRatio = InformationGain / SplitInfo

Mean Squared Error (MSE)

Used in regression trees. Minimizes variance within splits. Formula:

MSE = (1/n) Σ(yᵢ - ȳ)²
CriterionTypePurpose
Information GainClassificationSelects splits reducing entropy
Gini ImpurityClassificationMinimizes misclassification probability
Gain RatioClassificationNormalizes information gain bias
Mean Squared ErrorRegressionMinimizes variance within nodes

Pruning Methods

Purpose of Pruning

Reduce overfitting. Simplify tree structure. Improve generalization to unseen data.

Pre-Pruning (Early Stopping)

Stops tree growth early based on criteria: minimum samples, maximum depth, minimum impurity decrease.

Post-Pruning (Reduced Error Pruning)

Build full tree first. Then remove branches that do not reduce validation error. Uses hold-out or cross-validation data.

Cost-Complexity Pruning

Balances tree complexity and error rate. Uses parameter α to penalize size. Finds subtree minimizing cost:

R_α(T) = R(T) + α |T|

Pruning Algorithms

Bottom-up approach: prune leaves and evaluate error. Stop when pruning increases error. Common in CART and C4.5.

Handling Overfitting

Definition

Overfitting: model captures noise instead of underlying pattern. Result: poor generalization.

Causes

Excessive tree depth. Insufficient training data. High feature dimensionality.

Mitigation Techniques

Pruning (pre/post). Limiting maximum depth. Minimum samples per leaf. Cross-validation for hyperparameter tuning.

Ensemble Methods

Random Forests: aggregate multiple trees trained on bootstrap samples with feature randomness. Gradient Boosting: sequential tree correction on residuals.

Regularization

Penalize tree complexity in cost function. Use validation to select optimal parameters.

Feature Selection

Role in Decision Trees

Identify most informative features for splits. Reduces dimensionality. Improves interpretability and performance.

Importance Measures

Gini importance: total decrease in Gini impurity due to splits on feature. Information gain importance: sum of IG values.

Handling Continuous Features

Tests: threshold splits (≤ value). Selection by maximizing splitting criterion.

Handling Categorical Features

Multiway splits or binary splits after grouping categories.

Dimensionality Reduction

Irrelevant features cause noisy splits. Tree-based feature selection prunes such features naturally.

Types of Decision Trees

Classification Trees

Predict categorical labels. Use impurity measures like entropy or Gini index. Output: class or probability distribution.

Regression Trees

Predict continuous values. Use MSE or variance reduction. Output: numeric value (mean of target in leaf).

Multiway Trees

Splits can have more than two branches. Suitable for categorical features with multiple values.

Binary Trees

Each split produces two branches. Simplifies computation and interpretation.

Oblique Trees

Splits based on linear combinations of features instead of single feature thresholds. More complex but sometimes more accurate.

Advantages and Limitations

Advantages

Interpretability: easy to visualize and explain. Non-parametric: no assumptions about data distribution. Handles mixed data types. Fast prediction and training. Feature importance extraction.

Limitations

Overfitting tendency on noisy data. Instability: small data changes cause different trees. Bias towards features with many levels. Suboptimal splits due to greedy approach.

Comparison to Other Models

Compared to linear models: better for nonlinear relationships. Compared to SVMs and neural nets: less accurate but more interpretable.

Computational Complexity

Training: O(n m log n) (n = samples, m = features). Prediction: O(depth).

Scalability

Efficient on medium datasets. Large datasets require optimization or ensemble methods.

Applications

Classification Tasks

Medical diagnosis, credit scoring, spam detection, customer segmentation.

Regression Tasks

House price prediction, stock price forecasting, demand estimation.

Feature Selection and Data Exploration

Identify important variables in high-dimensional datasets.

Anomaly Detection

Identify unusual patterns by capturing normal decision boundaries.

Ensemble Learning

Base learners in Random Forests, Gradient Boosting Machines.

Implementation Details

Data Preparation

Handle missing values: imputation or surrogate splits. Encode categorical variables if needed.

Algorithmic Steps

Calculate splitting criteria for all features. Choose best split. Partition data. Recursively build subtrees.

Handling Missing Values

Use surrogate splits or assign to most probable branch. Avoid data loss.

Parameter Tuning

Control max depth, min samples per leaf, min impurity decrease. Use cross-validation.

Software Libraries

Common: scikit-learn, R rpart, Weka. Options for customization and visualization.

from sklearn.tree import DecisionTreeClassifierclf = DecisionTreeClassifier(max_depth=5, min_samples_split=10)clf.fit(X_train, y_train)predictions = clf.predict(X_test)

Evaluation Metrics

Classification Metrics

Accuracy, precision, recall, F1-score, ROC-AUC. Evaluate correctness of predicted labels.

Regression Metrics

Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), R² score.

Cross-Validation

K-fold, stratified sampling to estimate performance and prevent overfitting bias.

Confusion Matrix

Visualizes true positives, false positives, false negatives, true negatives. Diagnostic tool.

Feature Importance Evaluation

Assess contribution of each feature to predictive power. Guides feature engineering.

MetricTypePurpose
AccuracyClassificationOverall correctness
F1-ScoreClassificationBalance precision and recall
MSE, RMSERegressionMeasure average squared error
R² ScoreRegressionProportion of variance explained

References

  • Quinlan, J.R., “Induction of Decision Trees,” Machine Learning, vol. 1, no. 1, 1986, pp. 81-106.
  • Breiman, L., Friedman, J., Olshen, R., Stone, C., “Classification and Regression Trees,” Wadsworth, 1984.
  • Quinlan, J.R., “C4.5: Programs for Machine Learning,” Morgan Kaufmann, 1993.
  • Mitchell, T.M., “Machine Learning,” McGraw-Hill, 1997.
  • Hastie, T., Tibshirani, R., Friedman, J., “The Elements of Statistical Learning,” Springer, 2009.