Definition and Overview
Concept
Linear regression models linear relationships between dependent variable (target) and one or more independent variables (features). Objective: predict continuous outcomes by fitting a linear equation to observed data.
Type
Supervised learning: requires labeled datasets with input-output pairs. Regression task: output is continuous numerical value.
Historical Context
Originated in statistics (Gauss, Legendre, 18th-19th centuries). Widely adopted in machine learning for predictive analytics and data modeling.
"All models are wrong, but some are useful." -- George Box
Mathematical Formulation
Model Equation
Linear regression model: y = Xβ + ε, where y is vector of targets, X matrix of features, β vector of coefficients, ε error term.
Ordinary Least Squares (OLS)
Estimation method minimizing sum of squared residuals: ∑(y_i - x_iβ)². Objective: find β̂ minimizing residual sum of squares (RSS).
Closed-form Solution
β̂ = (XᵀX)⁻¹XᵀyWhere Xᵀ is transpose of X, and (XᵀX)⁻¹ inverse matrix.
Core Assumptions
Linearity
Relationship between features and target is linear in parameters.
Independence
Observations are independent of each other; no autocorrelation.
Homoscedasticity
Constant variance of error terms across all levels of independent variables.
Normality of Errors
Error terms are normally distributed with zero mean.
No Multicollinearity
Independent variables are not highly correlated among themselves.
Estimation Methods
Ordinary Least Squares (OLS)
Most common method: minimizes residual sum of squares, yields unbiased, efficient estimators under assumptions.
Gradient Descent
Iterative optimization: updates β parameters by moving opposite gradient of loss function. Suitable for large-scale data.
Stochastic Gradient Descent (SGD)
Variant of gradient descent using random subset (mini-batches) per iteration. Faster convergence on big datasets.
Maximum Likelihood Estimation
Interprets regression as probabilistic model; maximizes likelihood of observed data given parameters.
Model Evaluation Metrics
Mean Squared Error (MSE)
Average squared difference between predicted and actual values: sensitive to outliers.
Root Mean Squared Error (RMSE)
Square root of MSE; interpretable in same units as dependent variable.
Mean Absolute Error (MAE)
Average absolute differences; robust to outliers compared to MSE.
R-squared (Coefficient of Determination)
Proportion of variance in dependent variable explained by model. Range: 0 (no fit) to 1 (perfect fit).
Adjusted R-squared
Adjusts R² for number of predictors, penalizes overfitting.
| Metric | Formula | Interpretation |
|---|---|---|
| MSE | (1/n) ∑(y_i - ŷ_i)² | Lower is better; penalizes large errors |
| R-squared | 1 - (RSS/TSS) | Higher is better; variance explained |
Feature Selection and Importance
Purpose
Identify relevant predictors contributing most to target variable; reduce dimensionality and improve generalization.
Techniques
Forward selection, backward elimination, stepwise regression. Use statistical tests and information criteria (AIC, BIC).
Coefficient Magnitude
Size and sign of β coefficients indicate feature importance and direction of influence.
Multicollinearity Impact
High correlation can distort importance; requires diagnostics like Variance Inflation Factor (VIF).
Regularization Techniques
Purpose
Prevent overfitting, handle multicollinearity, improve model robustness by penalizing large coefficients.
Ridge Regression (L2)
Adds penalty proportional to square of coefficients: λ∑β_j². Shrinks coefficients towards zero but never zero exactly.
Lasso Regression (L1)
Penalty proportional to absolute value of coefficients: λ∑|β_j|. Enables sparse models by forcing some coefficients to zero.
Elastic Net
Combines L1 and L2 penalties; balances sparsity and coefficient shrinkage.
Loss = RSS + λ₁ ∑|β_j| + λ₂ ∑β_j²Interpretation of Coefficients
Magnitude
Indicates strength of feature’s effect on target. Larger absolute value means stronger influence.
Sign
Positive coefficient: direct relationship; negative: inverse relationship.
Units
Coefficient units: change in target per one unit change in feature, holding others constant.
Statistical Significance
Assessed via hypothesis tests (t-tests); determines if coefficients differ significantly from zero.
Applications
Economics
Modeling demand, supply, price elasticity, forecasting economic indicators.
Healthcare
Predicting patient outcomes, dosage-response relationships, risk factor analysis.
Engineering
System modeling, quality control, failure prediction.
Marketing
Sales forecasting, customer lifetime value prediction, advertising effectiveness.
Environmental Science
Modeling pollution levels, climate data analysis, resource utilization.
Limitations and Challenges
Linearity Constraint
Cannot capture nonlinear relationships without transformations or feature engineering.
Outliers Sensitivity
Highly affected by extreme values, may distort coefficient estimates and predictions.
Multicollinearity
Correlated predictors cause instability in coefficient estimation and interpretation.
Homoscedasticity Violation
Unequal error variances violate assumptions, affecting inference validity.
Overfitting
Model may fit training data noise if too many features or insufficient regularization.
Extensions and Variants
Polynomial Regression
Incorporates polynomial terms to model nonlinear relationships within linear framework.
Multiple Linear Regression
Uses multiple predictors simultaneously to improve predictive power.
Generalized Linear Models (GLM)
Extension to handle non-normal error distributions and link functions.
Robust Regression
Techniques (e.g., Huber regression) reduce impact of outliers and violations of assumptions.
Bayesian Linear Regression
Incorporates prior distributions on parameters for probabilistic inference.
Implementation in Practice
Data Preparation
Feature scaling, handling missing values, encoding categorical variables, outlier treatment.
Model Training
Selecting algorithm (OLS, gradient descent), splitting data into training and test sets.
Hyperparameter Tuning
Regularization strength λ, polynomial degree, learning rates optimized via cross-validation.
Software Tools
Popular libraries: scikit-learn (Python), statsmodels (Python), R lm(), MATLAB regress.
Example Code
from sklearn.linear_model import LinearRegressionmodel = LinearRegression()model.fit(X_train, y_train)predictions = model.predict(X_test)References
- Hastie, T., Tibshirani, R., Friedman, J. The Elements of Statistical Learning, Springer, 2nd ed., 2009, pp. 63-115.
- James, G., Witten, D., Hastie, T., Tibshirani, R. An Introduction to Statistical Learning, Springer, 2013, pp. 45-72.
- Seber, G.A.F., Lee, A.J. Linear Regression Analysis, Wiley, 2nd ed., 2003, pp. 1-40.
- Montgomery, D.C., Peck, E.A., Vining, G.G. Introduction to Linear Regression Analysis, Wiley, 5th ed., 2012, pp. 10-60.
- Bishop, C.M. Pattern Recognition and Machine Learning, Springer, 2006, pp. 87-105.