Introduction
Residual analysis is a critical step in regression analysis. It involves examining the differences between observed and predicted values to assess model adequacy. Residuals reveal model fit, assumption violations, and data anomalies. Effective residual analysis improves prediction accuracy, identifies outliers, and guides transformations or model choice.
"Residuals are the raw material of diagnostics in regression analysis: without them, no model assessment is possible." -- John Fox
Definition of Residuals
Mathematical Expression
Residual (ei): difference between observed value (yi) and predicted value (Ŷi).
ei = yi - ŶiInterpretation
Magnitude indicates prediction error. Sign shows direction: positive residuals indicate underprediction, negative indicate overprediction.
Role in Regression
Residuals quantify unexplained variation after fitting the regression model. They are central to model diagnostics.
Purpose of Residual Analysis
Model Validation
Checks if model assumptions hold. Validates linearity, homoscedasticity, normality, and independence.
Detecting Anomalies
Identifies outliers, leverage points, and influential data that distort model estimates.
Improving Model Fit
Guides transformations or alternative model selection by revealing systematic patterns in residuals.
Types of Residuals
Raw Residuals
Basic difference between observed and predicted values. Sensitive to scale.
Standardized Residuals
Raw residuals divided by their estimated standard deviation. Facilitates comparison.
ri = ei / (s √(1 - hii))Studentized Residuals
Standardized residuals with variance estimated excluding observation i. More robust for outlier detection.
Deleted Residuals
Residuals computed after removing observation i. Used to assess influence.
Regression Assumptions and Residuals
Linearity
Residuals should show no systematic pattern versus predictors or fitted values.
Independence
Residuals must be uncorrelated, especially in time series or clustered data.
Homoscedasticity
Constant variance of residuals across all levels of predictors.
Normality
Residuals should approximate normal distribution for valid inference.
Diagnostic Plots
Residuals vs Fitted Values
Detects non-linearity, unequal error variance, and outliers.
Normal Q-Q Plot
Assesses normality of residuals by comparing with theoretical quantiles.
Scale-Location Plot
Plots square root of standardized residuals against fitted values to detect heteroscedasticity.
Residuals vs Leverage
Identifies influential points combining residual size and leverage magnitude.
| Plot | Purpose | Typical Pattern |
|---|---|---|
| Residuals vs Fitted | Check linearity, homoscedasticity | Random scatter indicates good fit |
| Normal Q-Q | Assess residual normality | Points on line indicate normality |
| Scale-Location | Detect heteroscedasticity | Horizontal line preferred |
| Residuals vs Leverage | Identify influential points | Points outside Cook’s distance bands flagged |
Detecting Outliers and Influential Points
Outliers
Observations with large residuals indicating poor fit or data errors.
Leverage
Measures influence of an observation’s predictor values on fitted values. High leverage points can disproportionately affect model.
Cook’s Distance
Combines residual size and leverage to quantify influence of each point.
Di = (ei2 / (p · MSE)) · (hii / (1 - hii)2)Thresholds
Standardized residuals > ±2 or ±3 indicate outliers. Cook’s distance > 4/n suggests influential points.
Heteroscedasticity Detection
Definition
Non-constant variance of residuals violating homoscedasticity assumption.
Visual Detection
Residuals vs fitted plot shows funnel shape or systematic spread changes.
Statistical Tests
Breusch-Pagan and White tests detect heteroscedasticity formally.
Breusch-Pagan test:H0: homoscedasticityTest statistic = n · R2 from auxiliary regression of squared residuals on predictorsCompare with χ2 distributionConsequences
Invalid standard errors, confidence intervals, and hypothesis tests if ignored.
Testing Residual Normality
Importance
Normality assumption underpins inference validity in OLS regression.
Visual Methods
Normal Q-Q plots reveal deviations from normality by plotting ordered residuals against theoretical quantiles.
Formal Tests
Shapiro-Wilk, Kolmogorov-Smirnov, Anderson-Darling tests assess normality quantitatively.
Interpretation
Significant test results indicate residuals deviate from normality, requiring remedial measures.
Remedies and Model Improvement
Transformations
Apply log, square root, Box-Cox to stabilize variance and linearize relationships.
Robust Regression
Methods less sensitive to outliers: M-estimators, RANSAC, quantile regression.
Weighted Least Squares
Weights observations inversely proportional to variance to correct heteroscedasticity.
Model Respecification
Add interaction terms, polynomial predictors, or switch to non-linear models based on residual patterns.
Statistical Formulas
Residual Calculation
ei = yi - ŶiStandardized Residual
ri = ei / (s √(1 - hii))Cook’s Distance
Di = (ei2 / (p · MSE)) · (hii / (1 - hii)2)Breusch-Pagan Test Statistic
BP = n · R2 (from regression of ei2 on X)Practical Examples
Example 1: Residual Plot Interpretation
Data: Simple linear regression of sales on advertising spend.
Residuals vs fitted plot shows curved pattern indicating non-linearity. Remedy: add quadratic term.
Example 2: Outlier Detection
Dataset: House prices with one observation having a large standardized residual (3.5) and Cook’s distance (0.12) above threshold (4/n = 0.08).
Action: Investigate data collection for errors, consider robust regression.
| Observation | Standardized Residual | Cook’s Distance | Action |
|---|---|---|---|
| 27 | 3.5 | 0.12 | Investigate, possible removal |
| 15 | -0.8 | 0.02 | No action needed |
Example 3: Heteroscedasticity Test
Performed Breusch-Pagan test on residuals from income vs education regression. Test statistic = 9.8, p-value = 0.002.
Conclusion: Reject homoscedasticity; apply weighted least squares.
References
- Fox, J. "Applied Regression Analysis and Generalized Linear Models." Sage Publications, 2015, pp. 123-145.
- Montgomery, D.C., Peck, E.A., Vining, G.G. "Introduction to Linear Regression Analysis." Wiley, 2012, vol. 5, pp. 78-110.
- Cook, R.D. "Detection of Influential Observations in Linear Regression." Technometrics, vol. 19, no. 1, 1977, pp. 15-18.
- Breusch, T.S., Pagan, A.R. "A Simple Test for Heteroscedasticity and Random Coefficient Variation." Econometrica, vol. 47, no. 5, 1979, pp. 1287-1294.
- Shapiro, S.S., Wilk, M.B. "An Analysis of Variance Test for Normality (Complete Samples)." Biometrika, vol. 52, no. 3-4, 1965, pp. 591-611.