Introduction
Assumptions in regression analysis define conditions ensuring unbiased, consistent, and efficient parameter estimates. Violations distort inference, prediction, and hypothesis testing. Core assumptions center on functional form, error terms, and data properties. Understanding and testing assumptions is critical for robust model validity.
"All models are wrong, but some are useful." -- George E.P. Box
Linearity
Definition
Relationship between predictors and response is linear in parameters. Model: Y = β0 + β1X1 + ... + βkXk + ε.
Implications
Nonlinear relationships cause biased estimates, poor predictions. Transformations or nonlinear models required.
Detection
Scatterplots, residual plots, component-plus-residual plots.
Correction
Polynomial terms, interaction terms, nonlinear regression techniques.
Independence
Definition
Observations and error terms are mutually independent. No correlation across residuals.
Implications
Dependence inflates type I error rates, invalidates standard errors.
Detection
Durbin-Watson test, plots of residuals over time or observation order.
Correction
Time-series models (ARIMA), generalized least squares (GLS), clustered standard errors.
Homoscedasticity
Definition
Constant variance of error terms across all levels of independent variables.
Implications
Heteroscedasticity reduces efficiency, biases standard errors, invalidates tests.
Detection
Breusch-Pagan test, White test, residual vs. fitted value plots.
Correction
Weighted least squares, robust standard errors, transforming dependent variable.
Normality
Definition
Error terms are normally distributed with mean zero.
Implications
Ensures validity of t-tests, F-tests; non-normality affects inference but not coefficient estimates.
Detection
Q-Q plots, Shapiro-Wilk test, Kolmogorov-Smirnov test.
Correction
Transformations, bootstrapping, nonparametric methods.
Multicollinearity
Definition
High correlation among independent variables causing redundancy.
Implications
Inflated variances of estimates, unstable coefficients, reduced interpretability.
Detection
Variance inflation factor (VIF), condition index, correlation matrix.
Correction
Remove or combine variables, principal component analysis, ridge regression.
Model Specification
Definition
Correct inclusion of relevant predictors and functional form.
Implications
Mis-specification leads to biased, inconsistent estimators.
Detection
Ramsey RESET test, residual plots, omitted variable tests.
Correction
Add omitted variables, transform variables, check interaction terms.
Residual Analysis
Definition
Examination of differences between observed and predicted values (residuals).
Usage
Detect assumption violations, outliers, leverage points, influential observations.
Tools
Residual plots, standardized residuals, Cook’s distance, leverage statistics.
Interpretation
Patterns indicate model inadequacy or data issues.
Heteroscedasticity
Definition
Non-constant variance of error terms related to predictors or fitted values.
Consequences
Invalid standard errors, biased test statistics, inefficient OLS estimates.
Detection
Graphical residual analysis, Breusch-Pagan, White tests.
Remedies
Robust standard errors, weighted least squares, variance-stabilizing transformations.
Autocorrelation
Definition
Correlation of error terms across observations, often in time series data.
Effects
Biases standard errors, invalidates hypothesis tests, inefficient estimates.
Detection
Durbin-Watson test, Ljung-Box test, autocorrelation function (ACF) plots.
Correction
Generalized least squares, autoregressive models, Newey-West standard errors.
OLS Assumptions Summary
Classical Linear Regression Model Assumptions
1. Linearity: Y = Xβ + ε, linear in parameters2. Independence: Cov(ε_i, ε_j) = 0 for i ≠ j3. Homoscedasticity: Var(ε_i) = σ² constant4. Normality: ε_i ~ N(0, σ²)5. No perfect multicollinearity among X variables6. Exogeneity: E[ε_i|X] = 0 Consequences of Violation
Unbiasedness, consistency, efficiency, valid inference compromised.
Practical Notes
Large samples reduce normality concerns (CLT). Some violations correctable with robust methods.
| Assumption | Effect if Violated | Detection Method |
|---|---|---|
| Linearity | Biased estimates | Residual plots |
| Independence | Invalid SEs, tests | Durbin-Watson |
| Homoscedasticity | Inefficient estimates | Breusch-Pagan |
| Normality | Invalid inference | Q-Q plot, Shapiro-Wilk |
| Multicollinearity | Unstable estimates | VIF, correlations |
Regression Diagnostics
Purpose
Identify assumption violations, outliers, leverage and influence points.
Common Techniques
Residual plots, leverage statistics, Cook’s distance, DFBETAS, Hat matrix diagonals.
Interpretation
Large residuals: outliers. High leverage: influential predictors. High Cook’s distance: influential observations.
Example Diagnostics Table
| Diagnostic | Description | Threshold |
|---|---|---|
| Standardized Residual | Residual scaled by SD | >|3| indicates outlier |
| Leverage (Hat value) | Influence potential of X value | >2p/n considered high |
| Cook’s Distance | Combined influence measure | >1 suggests influential point |
Remediation
Investigate data errors, consider transformations, robust regression, or model refinement.
References
- Draper, N.R. and Smith, H. Applied Regression Analysis, Wiley, Vol. 326, 1998, pp. 1-700.
- Kutner, M.H., Nachtsheim, C.J., Neter, J., and Li, W. Applied Linear Statistical Models, McGraw-Hill, Vol. 5, 2005, pp. 1-1408.
- Wooldridge, J.M. Introductory Econometrics: A Modern Approach, South-Western, Vol. 6, 2015, pp. 1-720.
- Greene, W.H. Econometric Analysis, Pearson Education, Vol. 7, 2012, pp. 1-1150.
- Montgomery, D.C., Peck, E.A., and Vining, G.G. Introduction to Linear Regression Analysis, Wiley, Vol. 5, 2012, pp. 1-600.