Definition and Purpose

Concept

Simple linear regression models the relationship between two continuous variables: one independent (predictor) and one dependent (response). It assumes a linear association.

Purpose

Purpose: quantify strength and direction of linear relationship, predict values of dependent variable, explain variance in response data using predictor.

Terminology

Independent variable (X): predictor, explanatory variable. Dependent variable (Y): response, outcome variable. Regression line: best-fit line summarizing relationship.

"All models are wrong, but some are useful." -- George E. P. Box

Model Assumptions

Linearity

Relationship between X and Y is linear: E(Y|X) = β0 + β1X.

Independence

Observations are independent of each other; no autocorrelation.

Homoscedasticity

Constant variance of errors (residuals) across all levels of X.

Normality of Errors

Errors (residuals) are normally distributed with mean zero.

No multicollinearity

In simple linear regression, single predictor avoids multicollinearity issues.

Mathematical Formulation

Regression Equation

Y = β0 + β1X + ε where β0 = intercept, β1 = slope, ε = error term.

Error Term

ε captures deviation of observed Y from predicted value; assumed E(ε) = 0, Var(ε) = σ².

Interpretation

β0: expected value of Y when X=0. β1: expected change in Y per unit increase in X.

Y = β₀ + β₁X + εE(ε) = 0, Var(ε) = σ²

Parameter Estimation

Least Squares Method

Estimates β0, β1 by minimizing sum of squared residuals: Σ(Yi − Ŷi)².

Formulas for Estimates

β1 = Σ(Xi − X̄)(Yi − Ȳ) / Σ(Xi − X̄)² ; β0 = Ȳ − β1X̄.

Properties

Estimates are unbiased, efficient under Gauss-Markov assumptions; best linear unbiased estimators (BLUE).

β̂₁ = ∑(Xᵢ - X̄)(Yᵢ - Ȳ) / ∑(Xᵢ - X̄)²β̂₀ = Ȳ - β̂₁X̄

Computation

Implemented via matrix algebra or statistical software; simple closed-form solution for simple linear regression.

Interpretation of Coefficients

Intercept (β0)

Expected value of Y when X=0; meaningful only if X=0 lies within data range.

Slope (β1)

Rate of change of Y per unit increase in X; sign indicates direction of relationship.

Practical Meaning

Quantifies how predictor influences response; critical for prediction and inference.

Example

If β1 = 2, then Y increases by 2 units for each 1 unit increase in X.

Goodness of Fit

Coefficient of Determination (R²)

Proportion of variance in Y explained by X: R² = SSR / SST.

Explanation

Ranges from 0 to 1; higher values indicate better model fit.

Residual Standard Error (RSE)

Standard deviation of residuals; measures prediction error magnitude.

ANOVA Table

Partition of total variance into regression and residual components for significance testing.

SourceSum of SquaresDegrees of FreedomMean SquareF Statistic
RegressionSSR1MSR = SSR / 1F = MSR / MSE
ResidualSSEn - 2MSE = SSE / (n-2)
TotalSSTn - 1

Hypothesis Testing

Testing Slope (β1)

Null hypothesis H0: β1 = 0 (no linear relationship). Alternative H1: β1 ≠ 0.

T-Statistic

t = β̂1 / SE(β̂1), compared to t-distribution with n-2 degrees of freedom.

Confidence Intervals

95% CI for β1: β̂1 ± t* × SE(β̂1), where t* is critical t-value.

Interpretation

Reject H0 implies significant linear relationship between X and Y.

H₀: β₁ = 0t = (β̂₁) / SE(β̂₁) ~ t(n-2)Reject H₀ if |t| > tₐ/₂,n-2

Residual Analysis

Purpose

Evaluate model assumptions, detect outliers, non-linearity, heteroscedasticity.

Residual Plots

Plot residuals vs fitted values: random scatter indicates good fit; patterns suggest violations.

Normal Q-Q Plot

Assess normality of residuals; points along line indicate normal distribution.

Influence Diagnostics

Leverage and Cook’s distance identify influential observations affecting model.

Applications

Economics

Model relationship between income and expenditure, price and demand.

Biology

Relate enzyme activity to substrate concentration, growth rate to nutrient levels.

Engineering

Predict stress-strain relationships, system responses.

Social Sciences

Analyze effect of education on earnings, age on cognitive scores.

Environmental Science

Model pollutant concentration changes over time or location.

Limitations

Linearity Restriction

Only models linear relationships; nonlinear patterns require other methods.

Sensitivity to Outliers

Outliers can disproportionately affect estimates and fit.

Single Predictor

Ignores confounding or interacting variables; multivariate regression needed.

Assumption Violations

Violations of homoscedasticity, normality, independence reduce validity.

Extrapolation Risk

Predictions outside observed data range may be unreliable.

Extensions and Related Models

Multiple Linear Regression

Incorporates multiple predictors; models more complex relationships.

Polynomial Regression

Models nonlinear relationships by including powers of X.

Robust Regression

Reduces influence of outliers using alternative loss functions.

Generalized Linear Models

Extend linear regression to non-normal responses (e.g., logistic regression).

Nonparametric Regression

Flexible models without fixed functional form (e.g., splines, kernel methods).

References

  • Montgomery, D.C., Peck, E.A., Vining, G.G. Introduction to Linear Regression Analysis. Wiley, 5th ed., 2012, pp. 1-50.
  • Hastie, T., Tibshirani, R., Friedman, J. The Elements of Statistical Learning. Springer, 2nd ed., 2009, pp. 45-75.
  • Kutner, M.H., Nachtsheim, C.J., Neter, J., Li, W. Applied Linear Statistical Models. McGraw-Hill, 5th ed., 2005, pp. 100-150.
  • Seber, G.A.F., Lee, A.J. Linear Regression Analysis. Wiley, 2nd ed., 2003, pp. 20-60.
  • Freedman, D. Statistical Models: Theory and Practice. Cambridge University Press, 2009, pp. 15-40.