Introduction

Residual analysis is a critical step in regression analysis. It involves examining the differences between observed and predicted values to assess model adequacy. Residuals reveal model fit, assumption violations, and data anomalies. Effective residual analysis improves prediction accuracy, identifies outliers, and guides transformations or model choice.

"Residuals are the raw material of diagnostics in regression analysis: without them, no model assessment is possible." -- John Fox

Definition of Residuals

Mathematical Expression

Residual (ei): difference between observed value (yi) and predicted value (Ŷi).

ei = yi - Ŷi

Interpretation

Magnitude indicates prediction error. Sign shows direction: positive residuals indicate underprediction, negative indicate overprediction.

Role in Regression

Residuals quantify unexplained variation after fitting the regression model. They are central to model diagnostics.

Purpose of Residual Analysis

Model Validation

Checks if model assumptions hold. Validates linearity, homoscedasticity, normality, and independence.

Detecting Anomalies

Identifies outliers, leverage points, and influential data that distort model estimates.

Improving Model Fit

Guides transformations or alternative model selection by revealing systematic patterns in residuals.

Types of Residuals

Raw Residuals

Basic difference between observed and predicted values. Sensitive to scale.

Standardized Residuals

Raw residuals divided by their estimated standard deviation. Facilitates comparison.

ri = ei / (s √(1 - hii))

Studentized Residuals

Standardized residuals with variance estimated excluding observation i. More robust for outlier detection.

Deleted Residuals

Residuals computed after removing observation i. Used to assess influence.

Regression Assumptions and Residuals

Linearity

Residuals should show no systematic pattern versus predictors or fitted values.

Independence

Residuals must be uncorrelated, especially in time series or clustered data.

Homoscedasticity

Constant variance of residuals across all levels of predictors.

Normality

Residuals should approximate normal distribution for valid inference.

Diagnostic Plots

Residuals vs Fitted Values

Detects non-linearity, unequal error variance, and outliers.

Normal Q-Q Plot

Assesses normality of residuals by comparing with theoretical quantiles.

Scale-Location Plot

Plots square root of standardized residuals against fitted values to detect heteroscedasticity.

Residuals vs Leverage

Identifies influential points combining residual size and leverage magnitude.

PlotPurposeTypical Pattern
Residuals vs FittedCheck linearity, homoscedasticityRandom scatter indicates good fit
Normal Q-QAssess residual normalityPoints on line indicate normality
Scale-LocationDetect heteroscedasticityHorizontal line preferred
Residuals vs LeverageIdentify influential pointsPoints outside Cook’s distance bands flagged

Detecting Outliers and Influential Points

Outliers

Observations with large residuals indicating poor fit or data errors.

Leverage

Measures influence of an observation’s predictor values on fitted values. High leverage points can disproportionately affect model.

Cook’s Distance

Combines residual size and leverage to quantify influence of each point.

Di = (ei2 / (p · MSE)) · (hii / (1 - hii)2)

Thresholds

Standardized residuals > ±2 or ±3 indicate outliers. Cook’s distance > 4/n suggests influential points.

Heteroscedasticity Detection

Definition

Non-constant variance of residuals violating homoscedasticity assumption.

Visual Detection

Residuals vs fitted plot shows funnel shape or systematic spread changes.

Statistical Tests

Breusch-Pagan and White tests detect heteroscedasticity formally.

Breusch-Pagan test:H0: homoscedasticityTest statistic = n · R2 from auxiliary regression of squared residuals on predictorsCompare with χ2 distribution

Consequences

Invalid standard errors, confidence intervals, and hypothesis tests if ignored.

Testing Residual Normality

Importance

Normality assumption underpins inference validity in OLS regression.

Visual Methods

Normal Q-Q plots reveal deviations from normality by plotting ordered residuals against theoretical quantiles.

Formal Tests

Shapiro-Wilk, Kolmogorov-Smirnov, Anderson-Darling tests assess normality quantitatively.

Interpretation

Significant test results indicate residuals deviate from normality, requiring remedial measures.

Remedies and Model Improvement

Transformations

Apply log, square root, Box-Cox to stabilize variance and linearize relationships.

Robust Regression

Methods less sensitive to outliers: M-estimators, RANSAC, quantile regression.

Weighted Least Squares

Weights observations inversely proportional to variance to correct heteroscedasticity.

Model Respecification

Add interaction terms, polynomial predictors, or switch to non-linear models based on residual patterns.

Statistical Formulas

Residual Calculation

ei = yi - Ŷi

Standardized Residual

ri = ei / (s √(1 - hii))

Cook’s Distance

Di = (ei2 / (p · MSE)) · (hii / (1 - hii)2)

Breusch-Pagan Test Statistic

BP = n · R2 (from regression of ei2 on X)

Practical Examples

Example 1: Residual Plot Interpretation

Data: Simple linear regression of sales on advertising spend.

Residuals vs fitted plot shows curved pattern indicating non-linearity. Remedy: add quadratic term.

Example 2: Outlier Detection

Dataset: House prices with one observation having a large standardized residual (3.5) and Cook’s distance (0.12) above threshold (4/n = 0.08).

Action: Investigate data collection for errors, consider robust regression.

ObservationStandardized ResidualCook’s DistanceAction
273.50.12Investigate, possible removal
15-0.80.02No action needed

Example 3: Heteroscedasticity Test

Performed Breusch-Pagan test on residuals from income vs education regression. Test statistic = 9.8, p-value = 0.002.

Conclusion: Reject homoscedasticity; apply weighted least squares.

References

  • Fox, J. "Applied Regression Analysis and Generalized Linear Models." Sage Publications, 2015, pp. 123-145.
  • Montgomery, D.C., Peck, E.A., Vining, G.G. "Introduction to Linear Regression Analysis." Wiley, 2012, vol. 5, pp. 78-110.
  • Cook, R.D. "Detection of Influential Observations in Linear Regression." Technometrics, vol. 19, no. 1, 1977, pp. 15-18.
  • Breusch, T.S., Pagan, A.R. "A Simple Test for Heteroscedasticity and Random Coefficient Variation." Econometrica, vol. 47, no. 5, 1979, pp. 1287-1294.
  • Shapiro, S.S., Wilk, M.B. "An Analysis of Variance Test for Normality (Complete Samples)." Biometrika, vol. 52, no. 3-4, 1965, pp. 591-611.