Introduction

Assumptions in regression analysis define conditions ensuring unbiased, consistent, and efficient parameter estimates. Violations distort inference, prediction, and hypothesis testing. Core assumptions center on functional form, error terms, and data properties. Understanding and testing assumptions is critical for robust model validity.

"All models are wrong, but some are useful." -- George E.P. Box

Linearity

Definition

Relationship between predictors and response is linear in parameters. Model: Y = β0 + β1X1 + ... + βkXk + ε.

Implications

Nonlinear relationships cause biased estimates, poor predictions. Transformations or nonlinear models required.

Detection

Scatterplots, residual plots, component-plus-residual plots.

Correction

Polynomial terms, interaction terms, nonlinear regression techniques.

Independence

Definition

Observations and error terms are mutually independent. No correlation across residuals.

Implications

Dependence inflates type I error rates, invalidates standard errors.

Detection

Durbin-Watson test, plots of residuals over time or observation order.

Correction

Time-series models (ARIMA), generalized least squares (GLS), clustered standard errors.

Homoscedasticity

Definition

Constant variance of error terms across all levels of independent variables.

Implications

Heteroscedasticity reduces efficiency, biases standard errors, invalidates tests.

Detection

Breusch-Pagan test, White test, residual vs. fitted value plots.

Correction

Weighted least squares, robust standard errors, transforming dependent variable.

Normality

Definition

Error terms are normally distributed with mean zero.

Implications

Ensures validity of t-tests, F-tests; non-normality affects inference but not coefficient estimates.

Detection

Q-Q plots, Shapiro-Wilk test, Kolmogorov-Smirnov test.

Correction

Transformations, bootstrapping, nonparametric methods.

Multicollinearity

Definition

High correlation among independent variables causing redundancy.

Implications

Inflated variances of estimates, unstable coefficients, reduced interpretability.

Detection

Variance inflation factor (VIF), condition index, correlation matrix.

Correction

Remove or combine variables, principal component analysis, ridge regression.

Model Specification

Definition

Correct inclusion of relevant predictors and functional form.

Implications

Mis-specification leads to biased, inconsistent estimators.

Detection

Ramsey RESET test, residual plots, omitted variable tests.

Correction

Add omitted variables, transform variables, check interaction terms.

Residual Analysis

Definition

Examination of differences between observed and predicted values (residuals).

Usage

Detect assumption violations, outliers, leverage points, influential observations.

Tools

Residual plots, standardized residuals, Cook’s distance, leverage statistics.

Interpretation

Patterns indicate model inadequacy or data issues.

Heteroscedasticity

Definition

Non-constant variance of error terms related to predictors or fitted values.

Consequences

Invalid standard errors, biased test statistics, inefficient OLS estimates.

Detection

Graphical residual analysis, Breusch-Pagan, White tests.

Remedies

Robust standard errors, weighted least squares, variance-stabilizing transformations.

Autocorrelation

Definition

Correlation of error terms across observations, often in time series data.

Effects

Biases standard errors, invalidates hypothesis tests, inefficient estimates.

Detection

Durbin-Watson test, Ljung-Box test, autocorrelation function (ACF) plots.

Correction

Generalized least squares, autoregressive models, Newey-West standard errors.

OLS Assumptions Summary

Classical Linear Regression Model Assumptions

1. Linearity: Y = Xβ + ε, linear in parameters2. Independence: Cov(ε_i, ε_j) = 0 for i ≠ j3. Homoscedasticity: Var(ε_i) = σ² constant4. Normality: ε_i ~ N(0, σ²)5. No perfect multicollinearity among X variables6. Exogeneity: E[ε_i|X] = 0 

Consequences of Violation

Unbiasedness, consistency, efficiency, valid inference compromised.

Practical Notes

Large samples reduce normality concerns (CLT). Some violations correctable with robust methods.

AssumptionEffect if ViolatedDetection Method
LinearityBiased estimatesResidual plots
IndependenceInvalid SEs, testsDurbin-Watson
HomoscedasticityInefficient estimatesBreusch-Pagan
NormalityInvalid inferenceQ-Q plot, Shapiro-Wilk
MulticollinearityUnstable estimatesVIF, correlations

Regression Diagnostics

Purpose

Identify assumption violations, outliers, leverage and influence points.

Common Techniques

Residual plots, leverage statistics, Cook’s distance, DFBETAS, Hat matrix diagonals.

Interpretation

Large residuals: outliers. High leverage: influential predictors. High Cook’s distance: influential observations.

Example Diagnostics Table

DiagnosticDescriptionThreshold
Standardized ResidualResidual scaled by SD>|3| indicates outlier
Leverage (Hat value)Influence potential of X value>2p/n considered high
Cook’s DistanceCombined influence measure>1 suggests influential point

Remediation

Investigate data errors, consider transformations, robust regression, or model refinement.

References

  • Draper, N.R. and Smith, H. Applied Regression Analysis, Wiley, Vol. 326, 1998, pp. 1-700.
  • Kutner, M.H., Nachtsheim, C.J., Neter, J., and Li, W. Applied Linear Statistical Models, McGraw-Hill, Vol. 5, 2005, pp. 1-1408.
  • Wooldridge, J.M. Introductory Econometrics: A Modern Approach, South-Western, Vol. 6, 2015, pp. 1-720.
  • Greene, W.H. Econometric Analysis, Pearson Education, Vol. 7, 2012, pp. 1-1150.
  • Montgomery, D.C., Peck, E.A., and Vining, G.G. Introduction to Linear Regression Analysis, Wiley, Vol. 5, 2012, pp. 1-600.