Definition and Purpose

Regularization Concept

Regularization: technique to reduce model overfitting by adding penalty terms to loss functions. Purpose: enforce simplicity, prevent complexity explosion, enhance generalization to unseen data.

Optimization Context

Applied during training optimization. Modifies objective function with penalty proportional to parameter magnitude or structure. Balances data fitting and model complexity.

Scope and Usage

Used in regression, classification, neural networks, support vector machines, and other ML algorithms. Integral in high-dimensional and noisy datasets.

Overfitting and Underfitting

Overfitting

Occurs when model fits noise or random fluctuations in training data. High variance, low bias. Poor generalization on test data.

Underfitting

Model too simple to capture underlying pattern. High bias, low variance. Poor training and test performance.

Role of Regularization

Regularization reduces overfitting by penalizing complex models. Controls variance, improves test accuracy without sacrificing training fit excessively.

Types of Regularization

L1 Regularization

Also called Lasso. Adds sum of absolute values of parameters to loss. Encourages sparsity and feature selection.

L2 Regularization

Also called Ridge. Adds sum of squared parameters to loss. Encourages small weights, smoothness, distributes influence.

Elastic Net

Combines L1 and L2 penalties. Balances sparsity and weight shrinkage. Useful for correlated features.

Other Forms

Dropout, early stopping, data augmentation, Bayesian priors also considered regularization methods.

L1 Regularization (Lasso)

Mathematical Formulation

Objective: minimize loss + λ * ||w||₁. Where ||w||₁ = Σ|wᵢ|, λ ≥ 0 is regularization parameter.

min_w L(w) + λ ∑ |w_i|

Sparsity Induction

L1 penalty induces zeros in weights. Feature selection effect. Simplifies model, improves interpretability.

Optimization Challenges

Non-differentiable at zero. Requires specialized solvers like coordinate descent, subgradient methods, or proximal algorithms.

L2 Regularization (Ridge)

Mathematical Formulation

Objective: minimize loss + λ * ||w||₂². Where ||w||₂² = Σwᵢ², λ ≥ 0 is regularization parameter.

min_w L(w) + λ ∑ w_i^2

Shrinkage Effect

L2 penalty shrinks weights towards zero. Does not enforce sparsity. Reduces model complexity smoothly.

Optimization Properties

Convex and differentiable. Gradient-based methods apply directly. Efficient for large-scale problems.

Elastic Net Regularization

Definition

Weighted combination of L1 and L2 penalties. Balances sparsity and weight decay.

Mathematical Formulation

min_w L(w) + λ₁ ∑ |w_i| + λ₂ ∑ w_i^2

Advantages

Handles correlated features better than Lasso alone. Flexible control via λ₁, λ₂ parameters.

Regularization Terms and Penalties

Penalty Functions

Penalty terms quantify model complexity. Commonly L1 norm, L2 norm, or combination. Others: group lasso, nuclear norm.

Effect on Parameter Space

Constrains feasible solution set. Encourages smoothness or sparsity. Impacts convergence and stability.

Regularization Parameters

λ controls penalty strength. Larger λ: stronger regularization, simpler model, higher bias. Smaller λ: weaker regularization, complex model, higher variance.

Penalty TypeFormulaEffect
L1 (Lasso)∑ |wᵢ|Sparsity, feature selection
L2 (Ridge)∑ wᵢ²Weight shrinkage, smoothness
Elastic Netα∑ |wᵢ| + (1−α)∑ wᵢ²Sparsity + shrinkage

Bias-Variance Tradeoff

Bias

Error from erroneous assumptions in model. High bias: underfitting.

Variance

Error from sensitivity to data fluctuations. High variance: overfitting.

Regularization Impact

Regularization increases bias slightly, reduces variance significantly. Achieves optimal tradeoff for minimal generalization error.

Implementation in Optimization

Regularized Objective Function

General form: J(w) = Loss(w) + λ * Penalty(w). Optimization solves for w minimizing J(w).

Gradient Computation

Derivatives include penalty gradient: L2 penalty gradient = 2λw. L1 penalty gradient is subgradient due to non-differentiability.

Algorithm Adaptations

Standard optimizers modified to include penalty gradient. Proximal gradient methods for L1. Coordinate descent common for sparse solutions.

Gradient Descent Step:w := w - η (∇Loss(w) + λ ∇Penalty(w))

Effect on Different Models

Linear Models

Regularization controls coefficient magnitude. L1 induces sparsity, L2 shrinks coefficients. Improves stability and interpretability.

Neural Networks

Regularization applied to weights reduces overfitting. Methods include weight decay (L2), dropout, batch normalization.

Support Vector Machines (SVM)

Regularization parameter C controls margin width and misclassification penalty. Equivalent to L2 regularization in primal form.

Hyperparameter Tuning

Regularization Strength (λ)

Critical hyperparameter. Controls penalty magnitude. Usually selected via cross-validation or grid search.

Parameter Selection Methods

Grid search, random search, Bayesian optimization, gradient-based tuning. Evaluate on validation set for best generalization.

Tradeoffs

Over-regularization: underfitting. Under-regularization: overfitting. Proper tuning essential for balanced model.

Limitations and Challenges

Model Bias Increase

Regularization introduces bias. Excessive penalty reduces model flexibility.

Computational Complexity

L1 regularization non-smoothness complicates optimization. Requires specialized algorithms.

Feature Correlation

L1 may arbitrarily select one feature among correlated group. Elastic Net mitigates this issue.

References

  • Tibshirani, R. "Regression Shrinkage and Selection via the Lasso." Journal of the Royal Statistical Society, Series B, vol. 58, no. 1, 1996, pp. 267-288.
  • Hoerl, A.E., Kennard, R.W. "Ridge Regression: Biased Estimation for Nonorthogonal Problems." Technometrics, vol. 12, no. 1, 1970, pp. 55-67.
  • Zou, H., Hastie, T. "Regularization and Variable Selection via the Elastic Net." Journal of the Royal Statistical Society, Series B, vol. 67, no. 2, 2005, pp. 301-320.
  • Bishop, C.M. "Pattern Recognition and Machine Learning." Springer, 2006.
  • Goodfellow, I., Bengio, Y., Courville, A. "Deep Learning." MIT Press, 2016.