Definition and Purpose
Regularization Concept
Regularization: technique to reduce model overfitting by adding penalty terms to loss functions. Purpose: enforce simplicity, prevent complexity explosion, enhance generalization to unseen data.
Optimization Context
Applied during training optimization. Modifies objective function with penalty proportional to parameter magnitude or structure. Balances data fitting and model complexity.
Scope and Usage
Used in regression, classification, neural networks, support vector machines, and other ML algorithms. Integral in high-dimensional and noisy datasets.
Overfitting and Underfitting
Overfitting
Occurs when model fits noise or random fluctuations in training data. High variance, low bias. Poor generalization on test data.
Underfitting
Model too simple to capture underlying pattern. High bias, low variance. Poor training and test performance.
Role of Regularization
Regularization reduces overfitting by penalizing complex models. Controls variance, improves test accuracy without sacrificing training fit excessively.
Types of Regularization
L1 Regularization
Also called Lasso. Adds sum of absolute values of parameters to loss. Encourages sparsity and feature selection.
L2 Regularization
Also called Ridge. Adds sum of squared parameters to loss. Encourages small weights, smoothness, distributes influence.
Elastic Net
Combines L1 and L2 penalties. Balances sparsity and weight shrinkage. Useful for correlated features.
Other Forms
Dropout, early stopping, data augmentation, Bayesian priors also considered regularization methods.
L1 Regularization (Lasso)
Mathematical Formulation
Objective: minimize loss + λ * ||w||₁. Where ||w||₁ = Σ|wᵢ|, λ ≥ 0 is regularization parameter.
min_w L(w) + λ ∑ |w_i|Sparsity Induction
L1 penalty induces zeros in weights. Feature selection effect. Simplifies model, improves interpretability.
Optimization Challenges
Non-differentiable at zero. Requires specialized solvers like coordinate descent, subgradient methods, or proximal algorithms.
L2 Regularization (Ridge)
Mathematical Formulation
Objective: minimize loss + λ * ||w||₂². Where ||w||₂² = Σwᵢ², λ ≥ 0 is regularization parameter.
min_w L(w) + λ ∑ w_i^2Shrinkage Effect
L2 penalty shrinks weights towards zero. Does not enforce sparsity. Reduces model complexity smoothly.
Optimization Properties
Convex and differentiable. Gradient-based methods apply directly. Efficient for large-scale problems.
Elastic Net Regularization
Definition
Weighted combination of L1 and L2 penalties. Balances sparsity and weight decay.
Mathematical Formulation
min_w L(w) + λ₁ ∑ |w_i| + λ₂ ∑ w_i^2Advantages
Handles correlated features better than Lasso alone. Flexible control via λ₁, λ₂ parameters.
Regularization Terms and Penalties
Penalty Functions
Penalty terms quantify model complexity. Commonly L1 norm, L2 norm, or combination. Others: group lasso, nuclear norm.
Effect on Parameter Space
Constrains feasible solution set. Encourages smoothness or sparsity. Impacts convergence and stability.
Regularization Parameters
λ controls penalty strength. Larger λ: stronger regularization, simpler model, higher bias. Smaller λ: weaker regularization, complex model, higher variance.
| Penalty Type | Formula | Effect |
|---|---|---|
| L1 (Lasso) | ∑ |wᵢ| | Sparsity, feature selection |
| L2 (Ridge) | ∑ wᵢ² | Weight shrinkage, smoothness |
| Elastic Net | α∑ |wᵢ| + (1−α)∑ wᵢ² | Sparsity + shrinkage |
Bias-Variance Tradeoff
Bias
Error from erroneous assumptions in model. High bias: underfitting.
Variance
Error from sensitivity to data fluctuations. High variance: overfitting.
Regularization Impact
Regularization increases bias slightly, reduces variance significantly. Achieves optimal tradeoff for minimal generalization error.
Implementation in Optimization
Regularized Objective Function
General form: J(w) = Loss(w) + λ * Penalty(w). Optimization solves for w minimizing J(w).
Gradient Computation
Derivatives include penalty gradient: L2 penalty gradient = 2λw. L1 penalty gradient is subgradient due to non-differentiability.
Algorithm Adaptations
Standard optimizers modified to include penalty gradient. Proximal gradient methods for L1. Coordinate descent common for sparse solutions.
Gradient Descent Step:w := w - η (∇Loss(w) + λ ∇Penalty(w))Effect on Different Models
Linear Models
Regularization controls coefficient magnitude. L1 induces sparsity, L2 shrinks coefficients. Improves stability and interpretability.
Neural Networks
Regularization applied to weights reduces overfitting. Methods include weight decay (L2), dropout, batch normalization.
Support Vector Machines (SVM)
Regularization parameter C controls margin width and misclassification penalty. Equivalent to L2 regularization in primal form.
Hyperparameter Tuning
Regularization Strength (λ)
Critical hyperparameter. Controls penalty magnitude. Usually selected via cross-validation or grid search.
Parameter Selection Methods
Grid search, random search, Bayesian optimization, gradient-based tuning. Evaluate on validation set for best generalization.
Tradeoffs
Over-regularization: underfitting. Under-regularization: overfitting. Proper tuning essential for balanced model.
Limitations and Challenges
Model Bias Increase
Regularization introduces bias. Excessive penalty reduces model flexibility.
Computational Complexity
L1 regularization non-smoothness complicates optimization. Requires specialized algorithms.
Feature Correlation
L1 may arbitrarily select one feature among correlated group. Elastic Net mitigates this issue.
References
- Tibshirani, R. "Regression Shrinkage and Selection via the Lasso." Journal of the Royal Statistical Society, Series B, vol. 58, no. 1, 1996, pp. 267-288.
- Hoerl, A.E., Kennard, R.W. "Ridge Regression: Biased Estimation for Nonorthogonal Problems." Technometrics, vol. 12, no. 1, 1970, pp. 55-67.
- Zou, H., Hastie, T. "Regularization and Variable Selection via the Elastic Net." Journal of the Royal Statistical Society, Series B, vol. 67, no. 2, 2005, pp. 301-320.
- Bishop, C.M. "Pattern Recognition and Machine Learning." Springer, 2006.
- Goodfellow, I., Bengio, Y., Courville, A. "Deep Learning." MIT Press, 2016.