Introduction
Supervised learning is the most common machine learning paradigm: learn a predictive function from labeled training data. Given examples of inputs paired with desired outputs, the learner infers a function mapping new inputs to correct outputs.
Email spam detection: model learns from thousands of labeled emails (spam/not-spam) to classify new emails. Credit risk assessment: learns from loan histories (defaulted/repaid) to predict default probability. Image classification: learns from labeled images to recognize new images. The common thread: learning from labeled examples.
Supervised learning enables remarkable capabilities: GPT models trained on internet text predict next words; AlphaGo trained on game records learned to play Go. But also fundamental limitations: requires large labeled datasets (expensive to create), doesn't discover novel patterns humans don't label for, privacy concerns with sensitive labeled data.
"Supervised learning is learning by example: show the algorithm enough correct examples, and it learns to generalize to new cases. The quality of examples determines model quality." -- Andrew Ng, Stanford AI Lab
Problem Formulation
Mathematical Framework
Training data: D = {(x_i, y_i)}_{i=1}^n
x_i: input feature vector (predictors)
y_i: output label (response variable)
Goal: Learn function f: X -> Y such that f(x) ≈ y for unseen data
Loss function L(y_true, y_pred): measures prediction error
Objective: minimize expected loss on unseen data
Two Main Categories
Regression: y is continuous (real number). Example: house price prediction, temperature forecasting.
Classification: y is categorical (discrete classes). Example: spam detection, disease diagnosis.
Generalization Gap
Training error (error on training data): reflects how well model fits known examples. Test error (error on unseen data): reflects true generalization ability. Model is useful only if test error is low; training error alone is misleading.
Ideal: low training error AND low test error
Overfitting: low training error but high test error
Underfitting: high training error and high test error
Regression: Predicting Continuous Values
Linear Regression
Simplest approach: fit straight line (or hyperplane) through data.
y = w*x + b
w: weight (slope)
b: bias (intercept)
Loss (Mean Squared Error): L = 1/n * sum((y_pred - y_true)^2)
Multiple Linear Regression
y = w_1*x_1 + w_2*x_2 + ... + w_d*x_d + b
Closed-form solution (normal equations):
w = (X^T X)^{-1} X^T y
Polynomial Regression
Fit polynomial function instead of line. Enables non-linear relationships but risks overfitting (too-flexible model memorizes data).
Regularized Regression
Add penalty term to loss penalizing large weights (prevents overfitting):
Ridge (L2): L = MSE + lambda * ||w||_2^2
Lasso (L1): L = MSE + lambda * ||w||_1
Ridge shrinks weights; Lasso can zero out weights (feature selection)
Evaluation
Mean Squared Error (MSE): average squared error
Mean Absolute Error (MAE): average absolute error
R-squared: fraction of variance explained (0-1)
Classification: Predicting Categories
Logistic Regression
Despite name, classification algorithm. Predicts probability of positive class:
P(y=1|x) = sigmoid(w*x + b) = 1 / (1 + e^(-(w*x+b)))
Output: probability in [0, 1]
Decision: y_pred = 1 if P > 0.5, else 0
Decision Trees
Recursively partition feature space using yes/no questions. Easy to interpret ("if age > 30 and income > $50k then approve loan").
Advantages: Interpretable, handles non-linear relationships, handles categorical features naturally.
Disadvantages: Prone to overfitting, unstable (small data changes big tree changes).
Random Forests
Ensemble of decision trees. Each tree trained on random subset of data/features. Predictions averaged across trees. Reduces overfitting, improves robustness.
Support Vector Machines (SVM)
Find hyperplane maximizing margin between classes. Effective in high dimensions. Kernel trick extends to non-linear boundaries without explicit feature transformation.
Neural Networks
Composition of non-linear functions (layers). Universal approximators: can learn arbitrarily complex decision boundaries given sufficient hidden units and training data.
Evaluation Metrics
| Metric | Definition | Use Case |
|---|---|---|
| Accuracy | % correct predictions | Balanced classes |
| Precision | TP / (TP + FP) | Cost of false positives high |
| Recall | TP / (TP + FN) | Cost of false negatives high |
| F1 Score | Harmonic mean of precision/recall | Balanced importance |
| AUC-ROC | Area under ROC curve | Threshold-independent, imbalanced data |
Loss Functions and Objective Functions
Regression Loss Functions
Mean Squared Error (MSE): L = 1/n * sum((y_pred - y_true)^2). Standard, emphasizes large errors.
Mean Absolute Error (MAE): L = 1/n * sum(|y_pred - y_true|). Robust to outliers (less emphasis on large errors).
Huber Loss: Combination: squared for small errors, linear for large errors. Balances MSE (smooth) and MAE (robustness).
Classification Loss Functions
Cross-Entropy Loss: L = -sum(y_true * log(y_pred)). Standard for classification. Penalizes confident wrong predictions heavily.
Hinge Loss (SVM): L = max(0, 1 - y_true * y_pred). Margin-based: correct predictions with margin lose zero, incorrect or weak margins lose linearly.
Focal Loss: Weighted cross-entropy focusing on hard examples. Addresses class imbalance by down-weighting easy examples.
Regularization Terms
L1 Penalty: sum(|w|). Encourages sparse solutions (many w = 0).
L2 Penalty: sum(w^2). Encourages small weights (distributed importance).
Elastic Net: Combination of L1 and L2. Balances sparsity and smoothness.
Training Process and Optimization
Gradient Descent
Iteratively update weights in direction of negative gradient (downhill on loss landscape):
w_new = w_old - learning_rate * grad(L)
Variants
- Batch Gradient Descent: Use all data to compute gradient. Slow but stable.
- Stochastic Gradient Descent (SGD): Use single example. Fast but noisy.
- Mini-batch SGD: Use batch of ~32-256 examples. Balance: fast and stable.
Adaptive Methods
- Momentum: Accumulate gradient over time (acceleration). Helps escape shallow local minima.
- Adam: Adaptive learning rate per parameter. Combines advantages of momentum and RMSprop. Standard today.
Learning Rate
Critical hyperparameter. Too high: divergence. Too low: slow convergence. Often decayed during training: start high (fast learning), end low (fine-tuning).
Regularization and Overfitting Prevention
L1/L2 Regularization
Add penalty term to loss function penalizing complexity (large weights). Prevents overfitting by trading some training accuracy for generalization.
Dropout
Randomly zero out activations during training (neural networks). Prevents co-adaptation, effectively ensemble averaging at test time. One of most effective regularization techniques.
Early Stopping
Monitor validation loss during training. Stop when validation loss stops improving. Prevents overfitting without modifying loss function.
Data Augmentation
Artificially expand training data via transformations (rotations, crops, color shifts for images). Increases apparent dataset size, reduces overfitting.
Cross-Validation
Divide data into K folds. Train K models, each on K-1 folds, test on held-out fold. Average test errors give unbiased estimate of generalization performance.
Model Evaluation and Validation
Train-Validation-Test Split
Training set (60-70%): learn model parameters
Validation set (10-15%): tune hyperparameters, monitor overfitting
Test set (15-20%): final unbiased estimate of performance
Never tune on test set (overfits to test data)
Hyperparameter Tuning
Grid Search: Try all combinations of hyperparameter values. Expensive but thorough.
Random Search: Randomly sample hyperparameter space. More efficient than grid search.
Bayesian Optimization: Model relationship between hyperparameters and performance; intelligently select promising hyperparameters. Most efficient.
Learning Curves
Plot training and validation error vs. training set size. Diagnoses underfitting (both high), overfitting (gap between training and validation), or bias/variance tradeoff.
Standard Supervised Learning Algorithms
| Algorithm | Type | Pros | Cons |
|---|---|---|---|
| Linear/Logistic Regression | Regression/Classification | Fast, interpretable | Limited to linear patterns |
| Decision Trees | Both | Interpretable, non-linear | Overfitting, unstable |
| Random Forest | Both | Strong, robust, fast | Less interpretable |
| SVM | Both | High-dimensional, non-linear | Slow training, hyperparameter-sensitive |
| Neural Networks | Both | Very powerful, flexible | Black box, needs large data, slow |
Feature Engineering and Selection
Feature Engineering
Create new features from raw data. Domain knowledge guides creation: transformations, combinations, domain-specific features.
Raw: date (2023-03-15)
Engineered: day_of_week, month, is_weekend, days_since_epoch
Raw: text document
Engineered: word counts, TF-IDF, word embeddings
Feature Selection
Choose subset of features for model. Reduces dimensionality, removes noise, improves interpretability, speeds up training.
Filter Methods: Rank features by statistical relationship to target (independent of model).
Wrapper Methods: Train models, measure performance with different feature subsets. Computationally expensive but effective.
Embedded Methods: Feature selection integrated into model training (e.g., Lasso coefficients).
Dimensionality Reduction
Reduce number of features via transformations (PCA, t-SNE, autoencoders). Preserves most variance while reducing dimensionality.
Practical Considerations
Class Imbalance
In many classification tasks, classes imbalanced (e.g., 99% negative, 1% positive). Standard accuracy misleading. Solutions: stratified sampling, class weights, oversampling minority, undersampling majority, synthetic data generation (SMOTE).
Missing Data
Real data often incomplete. Options: delete incomplete samples (loses data), imputation (mean, median, learned), treat as separate category. Choice depends on missingness pattern and amount.
Scaling and Normalization
Many algorithms (SVM, neural networks, gradient descent) sensitive to feature scale. Normalize features to similar ranges (0-1, or zero-mean unit-variance). Use training set statistics for validation/test.
Computational Resources
Large datasets: neural networks require GPUs for reasonable training time. Tree methods parallelizable but memory-intensive for large trees. Linear methods most efficient.
Reproducibility
Set random seeds, record hyperparameters, document data preprocessing. Essential for debugging, reproducible research, production systems.
References
- Bishop, C. M. "Pattern Recognition and Machine Learning." Springer, 2006.
- Goodfellow, I., Bengio, Y., and Courville, A. "Deep Learning." MIT Press, 2016.
- Hastie, T., Tibshirani, R., and Friedman, J. "The Elements of Statistical Learning." Springer, 2nd edition, 2009.
- Ng, A., and Jordan, M. "On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes." NIPS, 2001.
- Kingma, D. P., and Ba, J. "Adam: A Method for Stochastic Optimization." ICLR, 2015.