Definition and Purpose
Concept
Simple linear regression models the relationship between two continuous variables: one independent (predictor) and one dependent (response). It assumes a linear association.
Purpose
Purpose: quantify strength and direction of linear relationship, predict values of dependent variable, explain variance in response data using predictor.
Terminology
Independent variable (X): predictor, explanatory variable. Dependent variable (Y): response, outcome variable. Regression line: best-fit line summarizing relationship.
"All models are wrong, but some are useful." -- George E. P. Box
Model Assumptions
Linearity
Relationship between X and Y is linear: E(Y|X) = β0 + β1X.
Independence
Observations are independent of each other; no autocorrelation.
Homoscedasticity
Constant variance of errors (residuals) across all levels of X.
Normality of Errors
Errors (residuals) are normally distributed with mean zero.
No multicollinearity
In simple linear regression, single predictor avoids multicollinearity issues.
Mathematical Formulation
Regression Equation
Y = β0 + β1X + ε where β0 = intercept, β1 = slope, ε = error term.
Error Term
ε captures deviation of observed Y from predicted value; assumed E(ε) = 0, Var(ε) = σ².
Interpretation
β0: expected value of Y when X=0. β1: expected change in Y per unit increase in X.
Y = β₀ + β₁X + εE(ε) = 0, Var(ε) = σ²Parameter Estimation
Least Squares Method
Estimates β0, β1 by minimizing sum of squared residuals: Σ(Yi − Ŷi)².
Formulas for Estimates
β1 = Σ(Xi − X̄)(Yi − Ȳ) / Σ(Xi − X̄)² ; β0 = Ȳ − β1X̄.
Properties
Estimates are unbiased, efficient under Gauss-Markov assumptions; best linear unbiased estimators (BLUE).
β̂₁ = ∑(Xᵢ - X̄)(Yᵢ - Ȳ) / ∑(Xᵢ - X̄)²β̂₀ = Ȳ - β̂₁X̄Computation
Implemented via matrix algebra or statistical software; simple closed-form solution for simple linear regression.
Interpretation of Coefficients
Intercept (β0)
Expected value of Y when X=0; meaningful only if X=0 lies within data range.
Slope (β1)
Rate of change of Y per unit increase in X; sign indicates direction of relationship.
Practical Meaning
Quantifies how predictor influences response; critical for prediction and inference.
Example
If β1 = 2, then Y increases by 2 units for each 1 unit increase in X.
Goodness of Fit
Coefficient of Determination (R²)
Proportion of variance in Y explained by X: R² = SSR / SST.
Explanation
Ranges from 0 to 1; higher values indicate better model fit.
Residual Standard Error (RSE)
Standard deviation of residuals; measures prediction error magnitude.
ANOVA Table
Partition of total variance into regression and residual components for significance testing.
| Source | Sum of Squares | Degrees of Freedom | Mean Square | F Statistic |
|---|---|---|---|---|
| Regression | SSR | 1 | MSR = SSR / 1 | F = MSR / MSE |
| Residual | SSE | n - 2 | MSE = SSE / (n-2) | |
| Total | SST | n - 1 |
Hypothesis Testing
Testing Slope (β1)
Null hypothesis H0: β1 = 0 (no linear relationship). Alternative H1: β1 ≠ 0.
T-Statistic
t = β̂1 / SE(β̂1), compared to t-distribution with n-2 degrees of freedom.
Confidence Intervals
95% CI for β1: β̂1 ± t* × SE(β̂1), where t* is critical t-value.
Interpretation
Reject H0 implies significant linear relationship between X and Y.
H₀: β₁ = 0t = (β̂₁) / SE(β̂₁) ~ t(n-2)Reject H₀ if |t| > tₐ/₂,n-2Residual Analysis
Purpose
Evaluate model assumptions, detect outliers, non-linearity, heteroscedasticity.
Residual Plots
Plot residuals vs fitted values: random scatter indicates good fit; patterns suggest violations.
Normal Q-Q Plot
Assess normality of residuals; points along line indicate normal distribution.
Influence Diagnostics
Leverage and Cook’s distance identify influential observations affecting model.
Applications
Economics
Model relationship between income and expenditure, price and demand.
Biology
Relate enzyme activity to substrate concentration, growth rate to nutrient levels.
Engineering
Predict stress-strain relationships, system responses.
Social Sciences
Analyze effect of education on earnings, age on cognitive scores.
Environmental Science
Model pollutant concentration changes over time or location.
Limitations
Linearity Restriction
Only models linear relationships; nonlinear patterns require other methods.
Sensitivity to Outliers
Outliers can disproportionately affect estimates and fit.
Single Predictor
Ignores confounding or interacting variables; multivariate regression needed.
Assumption Violations
Violations of homoscedasticity, normality, independence reduce validity.
Extrapolation Risk
Predictions outside observed data range may be unreliable.
Extensions and Related Models
Multiple Linear Regression
Incorporates multiple predictors; models more complex relationships.
Polynomial Regression
Models nonlinear relationships by including powers of X.
Robust Regression
Reduces influence of outliers using alternative loss functions.
Generalized Linear Models
Extend linear regression to non-normal responses (e.g., logistic regression).
Nonparametric Regression
Flexible models without fixed functional form (e.g., splines, kernel methods).
References
- Montgomery, D.C., Peck, E.A., Vining, G.G. Introduction to Linear Regression Analysis. Wiley, 5th ed., 2012, pp. 1-50.
- Hastie, T., Tibshirani, R., Friedman, J. The Elements of Statistical Learning. Springer, 2nd ed., 2009, pp. 45-75.
- Kutner, M.H., Nachtsheim, C.J., Neter, J., Li, W. Applied Linear Statistical Models. McGraw-Hill, 5th ed., 2005, pp. 100-150.
- Seber, G.A.F., Lee, A.J. Linear Regression Analysis. Wiley, 2nd ed., 2003, pp. 20-60.
- Freedman, D. Statistical Models: Theory and Practice. Cambridge University Press, 2009, pp. 15-40.