Definition and Overview

Concept

Linear regression models linear relationships between dependent variable (target) and one or more independent variables (features). Objective: predict continuous outcomes by fitting a linear equation to observed data.

Type

Supervised learning: requires labeled datasets with input-output pairs. Regression task: output is continuous numerical value.

Historical Context

Originated in statistics (Gauss, Legendre, 18th-19th centuries). Widely adopted in machine learning for predictive analytics and data modeling.

"All models are wrong, but some are useful." -- George Box

Mathematical Formulation

Model Equation

Linear regression model: y = Xβ + ε, where y is vector of targets, X matrix of features, β vector of coefficients, ε error term.

Ordinary Least Squares (OLS)

Estimation method minimizing sum of squared residuals: ∑(y_i - x_iβ)². Objective: find β̂ minimizing residual sum of squares (RSS).

Closed-form Solution

β̂ = (XᵀX)⁻¹Xᵀy

Where Xᵀ is transpose of X, and (XᵀX)⁻¹ inverse matrix.

Core Assumptions

Linearity

Relationship between features and target is linear in parameters.

Independence

Observations are independent of each other; no autocorrelation.

Homoscedasticity

Constant variance of error terms across all levels of independent variables.

Normality of Errors

Error terms are normally distributed with zero mean.

No Multicollinearity

Independent variables are not highly correlated among themselves.

Estimation Methods

Ordinary Least Squares (OLS)

Most common method: minimizes residual sum of squares, yields unbiased, efficient estimators under assumptions.

Gradient Descent

Iterative optimization: updates β parameters by moving opposite gradient of loss function. Suitable for large-scale data.

Stochastic Gradient Descent (SGD)

Variant of gradient descent using random subset (mini-batches) per iteration. Faster convergence on big datasets.

Maximum Likelihood Estimation

Interprets regression as probabilistic model; maximizes likelihood of observed data given parameters.

Model Evaluation Metrics

Mean Squared Error (MSE)

Average squared difference between predicted and actual values: sensitive to outliers.

Root Mean Squared Error (RMSE)

Square root of MSE; interpretable in same units as dependent variable.

Mean Absolute Error (MAE)

Average absolute differences; robust to outliers compared to MSE.

R-squared (Coefficient of Determination)

Proportion of variance in dependent variable explained by model. Range: 0 (no fit) to 1 (perfect fit).

Adjusted R-squared

Adjusts R² for number of predictors, penalizes overfitting.

MetricFormulaInterpretation
MSE(1/n) ∑(y_i - ŷ_i)²Lower is better; penalizes large errors
R-squared1 - (RSS/TSS)Higher is better; variance explained

Feature Selection and Importance

Purpose

Identify relevant predictors contributing most to target variable; reduce dimensionality and improve generalization.

Techniques

Forward selection, backward elimination, stepwise regression. Use statistical tests and information criteria (AIC, BIC).

Coefficient Magnitude

Size and sign of β coefficients indicate feature importance and direction of influence.

Multicollinearity Impact

High correlation can distort importance; requires diagnostics like Variance Inflation Factor (VIF).

Regularization Techniques

Purpose

Prevent overfitting, handle multicollinearity, improve model robustness by penalizing large coefficients.

Ridge Regression (L2)

Adds penalty proportional to square of coefficients: λ∑β_j². Shrinks coefficients towards zero but never zero exactly.

Lasso Regression (L1)

Penalty proportional to absolute value of coefficients: λ∑|β_j|. Enables sparse models by forcing some coefficients to zero.

Elastic Net

Combines L1 and L2 penalties; balances sparsity and coefficient shrinkage.

Loss = RSS + λ₁ ∑|β_j| + λ₂ ∑β_j²

Interpretation of Coefficients

Magnitude

Indicates strength of feature’s effect on target. Larger absolute value means stronger influence.

Sign

Positive coefficient: direct relationship; negative: inverse relationship.

Units

Coefficient units: change in target per one unit change in feature, holding others constant.

Statistical Significance

Assessed via hypothesis tests (t-tests); determines if coefficients differ significantly from zero.

Applications

Economics

Modeling demand, supply, price elasticity, forecasting economic indicators.

Healthcare

Predicting patient outcomes, dosage-response relationships, risk factor analysis.

Engineering

System modeling, quality control, failure prediction.

Marketing

Sales forecasting, customer lifetime value prediction, advertising effectiveness.

Environmental Science

Modeling pollution levels, climate data analysis, resource utilization.

Limitations and Challenges

Linearity Constraint

Cannot capture nonlinear relationships without transformations or feature engineering.

Outliers Sensitivity

Highly affected by extreme values, may distort coefficient estimates and predictions.

Multicollinearity

Correlated predictors cause instability in coefficient estimation and interpretation.

Homoscedasticity Violation

Unequal error variances violate assumptions, affecting inference validity.

Overfitting

Model may fit training data noise if too many features or insufficient regularization.

Extensions and Variants

Polynomial Regression

Incorporates polynomial terms to model nonlinear relationships within linear framework.

Multiple Linear Regression

Uses multiple predictors simultaneously to improve predictive power.

Generalized Linear Models (GLM)

Extension to handle non-normal error distributions and link functions.

Robust Regression

Techniques (e.g., Huber regression) reduce impact of outliers and violations of assumptions.

Bayesian Linear Regression

Incorporates prior distributions on parameters for probabilistic inference.

Implementation in Practice

Data Preparation

Feature scaling, handling missing values, encoding categorical variables, outlier treatment.

Model Training

Selecting algorithm (OLS, gradient descent), splitting data into training and test sets.

Hyperparameter Tuning

Regularization strength λ, polynomial degree, learning rates optimized via cross-validation.

Software Tools

Popular libraries: scikit-learn (Python), statsmodels (Python), R lm(), MATLAB regress.

Example Code

from sklearn.linear_model import LinearRegressionmodel = LinearRegression()model.fit(X_train, y_train)predictions = model.predict(X_test)

References

  • Hastie, T., Tibshirani, R., Friedman, J. The Elements of Statistical Learning, Springer, 2nd ed., 2009, pp. 63-115.
  • James, G., Witten, D., Hastie, T., Tibshirani, R. An Introduction to Statistical Learning, Springer, 2013, pp. 45-72.
  • Seber, G.A.F., Lee, A.J. Linear Regression Analysis, Wiley, 2nd ed., 2003, pp. 1-40.
  • Montgomery, D.C., Peck, E.A., Vining, G.G. Introduction to Linear Regression Analysis, Wiley, 5th ed., 2012, pp. 10-60.
  • Bishop, C.M. Pattern Recognition and Machine Learning, Springer, 2006, pp. 87-105.