Introduction

Supervised learning is the most common machine learning paradigm: learn a predictive function from labeled training data. Given examples of inputs paired with desired outputs, the learner infers a function mapping new inputs to correct outputs.

Email spam detection: model learns from thousands of labeled emails (spam/not-spam) to classify new emails. Credit risk assessment: learns from loan histories (defaulted/repaid) to predict default probability. Image classification: learns from labeled images to recognize new images. The common thread: learning from labeled examples.

Supervised learning enables remarkable capabilities: GPT models trained on internet text predict next words; AlphaGo trained on game records learned to play Go. But also fundamental limitations: requires large labeled datasets (expensive to create), doesn't discover novel patterns humans don't label for, privacy concerns with sensitive labeled data.

"Supervised learning is learning by example: show the algorithm enough correct examples, and it learns to generalize to new cases. The quality of examples determines model quality." -- Andrew Ng, Stanford AI Lab

Problem Formulation

Mathematical Framework

Training data: D = {(x_i, y_i)}_{i=1}^n
 x_i: input feature vector (predictors)
 y_i: output label (response variable)

Goal: Learn function f: X -> Y such that f(x) ≈ y for unseen data

Loss function L(y_true, y_pred): measures prediction error
Objective: minimize expected loss on unseen data

Two Main Categories

Regression: y is continuous (real number). Example: house price prediction, temperature forecasting.

Classification: y is categorical (discrete classes). Example: spam detection, disease diagnosis.

Generalization Gap

Training error (error on training data): reflects how well model fits known examples. Test error (error on unseen data): reflects true generalization ability. Model is useful only if test error is low; training error alone is misleading.

Ideal: low training error AND low test error
Overfitting: low training error but high test error
Underfitting: high training error and high test error

Regression: Predicting Continuous Values

Linear Regression

Simplest approach: fit straight line (or hyperplane) through data.

y = w*x + b
 w: weight (slope)
 b: bias (intercept)

Loss (Mean Squared Error): L = 1/n * sum((y_pred - y_true)^2)

Multiple Linear Regression

y = w_1*x_1 + w_2*x_2 + ... + w_d*x_d + b

Closed-form solution (normal equations):
w = (X^T X)^{-1} X^T y

Polynomial Regression

Fit polynomial function instead of line. Enables non-linear relationships but risks overfitting (too-flexible model memorizes data).

Regularized Regression

Add penalty term to loss penalizing large weights (prevents overfitting):

Ridge (L2): L = MSE + lambda * ||w||_2^2
Lasso (L1): L = MSE + lambda * ||w||_1

Ridge shrinks weights; Lasso can zero out weights (feature selection)

Evaluation

Mean Squared Error (MSE): average squared error
Mean Absolute Error (MAE): average absolute error
R-squared: fraction of variance explained (0-1)

Classification: Predicting Categories

Logistic Regression

Despite name, classification algorithm. Predicts probability of positive class:

P(y=1|x) = sigmoid(w*x + b) = 1 / (1 + e^(-(w*x+b)))

Output: probability in [0, 1]
Decision: y_pred = 1 if P > 0.5, else 0

Decision Trees

Recursively partition feature space using yes/no questions. Easy to interpret ("if age > 30 and income > $50k then approve loan").

Advantages: Interpretable, handles non-linear relationships, handles categorical features naturally.

Disadvantages: Prone to overfitting, unstable (small data changes big tree changes).

Random Forests

Ensemble of decision trees. Each tree trained on random subset of data/features. Predictions averaged across trees. Reduces overfitting, improves robustness.

Support Vector Machines (SVM)

Find hyperplane maximizing margin between classes. Effective in high dimensions. Kernel trick extends to non-linear boundaries without explicit feature transformation.

Neural Networks

Composition of non-linear functions (layers). Universal approximators: can learn arbitrarily complex decision boundaries given sufficient hidden units and training data.

Evaluation Metrics

Metric Definition Use Case
Accuracy % correct predictions Balanced classes
Precision TP / (TP + FP) Cost of false positives high
Recall TP / (TP + FN) Cost of false negatives high
F1 Score Harmonic mean of precision/recall Balanced importance
AUC-ROC Area under ROC curve Threshold-independent, imbalanced data

Loss Functions and Objective Functions

Regression Loss Functions

Mean Squared Error (MSE): L = 1/n * sum((y_pred - y_true)^2). Standard, emphasizes large errors.

Mean Absolute Error (MAE): L = 1/n * sum(|y_pred - y_true|). Robust to outliers (less emphasis on large errors).

Huber Loss: Combination: squared for small errors, linear for large errors. Balances MSE (smooth) and MAE (robustness).

Classification Loss Functions

Cross-Entropy Loss: L = -sum(y_true * log(y_pred)). Standard for classification. Penalizes confident wrong predictions heavily.

Hinge Loss (SVM): L = max(0, 1 - y_true * y_pred). Margin-based: correct predictions with margin lose zero, incorrect or weak margins lose linearly.

Focal Loss: Weighted cross-entropy focusing on hard examples. Addresses class imbalance by down-weighting easy examples.

Regularization Terms

L1 Penalty: sum(|w|). Encourages sparse solutions (many w = 0).

L2 Penalty: sum(w^2). Encourages small weights (distributed importance).

Elastic Net: Combination of L1 and L2. Balances sparsity and smoothness.

Training Process and Optimization

Gradient Descent

Iteratively update weights in direction of negative gradient (downhill on loss landscape):

w_new = w_old - learning_rate * grad(L)

Variants

  • Batch Gradient Descent: Use all data to compute gradient. Slow but stable.
  • Stochastic Gradient Descent (SGD): Use single example. Fast but noisy.
  • Mini-batch SGD: Use batch of ~32-256 examples. Balance: fast and stable.

Adaptive Methods

  • Momentum: Accumulate gradient over time (acceleration). Helps escape shallow local minima.
  • Adam: Adaptive learning rate per parameter. Combines advantages of momentum and RMSprop. Standard today.

Learning Rate

Critical hyperparameter. Too high: divergence. Too low: slow convergence. Often decayed during training: start high (fast learning), end low (fine-tuning).

Regularization and Overfitting Prevention

L1/L2 Regularization

Add penalty term to loss function penalizing complexity (large weights). Prevents overfitting by trading some training accuracy for generalization.

Dropout

Randomly zero out activations during training (neural networks). Prevents co-adaptation, effectively ensemble averaging at test time. One of most effective regularization techniques.

Early Stopping

Monitor validation loss during training. Stop when validation loss stops improving. Prevents overfitting without modifying loss function.

Data Augmentation

Artificially expand training data via transformations (rotations, crops, color shifts for images). Increases apparent dataset size, reduces overfitting.

Cross-Validation

Divide data into K folds. Train K models, each on K-1 folds, test on held-out fold. Average test errors give unbiased estimate of generalization performance.

Model Evaluation and Validation

Train-Validation-Test Split

Training set (60-70%): learn model parameters
Validation set (10-15%): tune hyperparameters, monitor overfitting
Test set (15-20%): final unbiased estimate of performance

Never tune on test set (overfits to test data)

Hyperparameter Tuning

Grid Search: Try all combinations of hyperparameter values. Expensive but thorough.

Random Search: Randomly sample hyperparameter space. More efficient than grid search.

Bayesian Optimization: Model relationship between hyperparameters and performance; intelligently select promising hyperparameters. Most efficient.

Learning Curves

Plot training and validation error vs. training set size. Diagnoses underfitting (both high), overfitting (gap between training and validation), or bias/variance tradeoff.

Standard Supervised Learning Algorithms

Algorithm Type Pros Cons
Linear/Logistic Regression Regression/Classification Fast, interpretable Limited to linear patterns
Decision Trees Both Interpretable, non-linear Overfitting, unstable
Random Forest Both Strong, robust, fast Less interpretable
SVM Both High-dimensional, non-linear Slow training, hyperparameter-sensitive
Neural Networks Both Very powerful, flexible Black box, needs large data, slow

Feature Engineering and Selection

Feature Engineering

Create new features from raw data. Domain knowledge guides creation: transformations, combinations, domain-specific features.

Raw: date (2023-03-15)
Engineered: day_of_week, month, is_weekend, days_since_epoch

Raw: text document
Engineered: word counts, TF-IDF, word embeddings

Feature Selection

Choose subset of features for model. Reduces dimensionality, removes noise, improves interpretability, speeds up training.

Filter Methods: Rank features by statistical relationship to target (independent of model).

Wrapper Methods: Train models, measure performance with different feature subsets. Computationally expensive but effective.

Embedded Methods: Feature selection integrated into model training (e.g., Lasso coefficients).

Dimensionality Reduction

Reduce number of features via transformations (PCA, t-SNE, autoencoders). Preserves most variance while reducing dimensionality.

Practical Considerations

Class Imbalance

In many classification tasks, classes imbalanced (e.g., 99% negative, 1% positive). Standard accuracy misleading. Solutions: stratified sampling, class weights, oversampling minority, undersampling majority, synthetic data generation (SMOTE).

Missing Data

Real data often incomplete. Options: delete incomplete samples (loses data), imputation (mean, median, learned), treat as separate category. Choice depends on missingness pattern and amount.

Scaling and Normalization

Many algorithms (SVM, neural networks, gradient descent) sensitive to feature scale. Normalize features to similar ranges (0-1, or zero-mean unit-variance). Use training set statistics for validation/test.

Computational Resources

Large datasets: neural networks require GPUs for reasonable training time. Tree methods parallelizable but memory-intensive for large trees. Linear methods most efficient.

Reproducibility

Set random seeds, record hyperparameters, document data preprocessing. Essential for debugging, reproducible research, production systems.

References

  • Bishop, C. M. "Pattern Recognition and Machine Learning." Springer, 2006.
  • Goodfellow, I., Bengio, Y., and Courville, A. "Deep Learning." MIT Press, 2016.
  • Hastie, T., Tibshirani, R., and Friedman, J. "The Elements of Statistical Learning." Springer, 2nd edition, 2009.
  • Ng, A., and Jordan, M. "On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes." NIPS, 2001.
  • Kingma, D. P., and Ba, J. "Adam: A Method for Stochastic Optimization." ICLR, 2015.